Available steps¶
Source steps¶
bcl2fastq_source¶
- Connections:
- Output Connection:
- ‘out/configureBcl2Fastq_log_stderr’
- ‘out/make_log_stderr’
- ‘out/sample_sheet’
- Output Connection:
- Options:
- adapter-sequence (str, optional) - adapter-stringency (str, optional) - fastq-cluster-count (int, optional) - filter-dir (str, optional) - flowcell-id (str, optional) - ignore-missing-bcl (bool, optional) - ignore-missing-control (bool, optional) - ignore-missing-stats (bool, optional) - input-dir (str, required) – file URL
- intensities-dir (str, optional) - mismatches (int, optional) - no-eamss (str, optional) - output-dir (str, optional) - positions-dir (str, optional) - positions-format (str, optional) - sample-sheet (str, required) - tiles (str, optional) - use-bases-mask (str, optional) – Conversion mask characters:- Y or y: use- N or n: discard- I or i: use for indexingIf not given, the mask will be guessed from theRunInfo.xml file in the run folder.For instance, in a 2x76 indexed paired end run, themask Y76,I6n,y75n means: “use all 76 bases from thefirst end, discard the last base of the indexing read,and use only the first 75 bases of the second end”.
- with-failed-reads (str, optional)**Required tools:** configureBclToFastq.pl, make, mkdir, mv
This step provides input files which already exists and therefore creates no tasks in the pipeline.
fastq_source¶
The FastqSource class acts as a source for FASTQ files. This source creates a run for every sample.
Specify a file name pattern in pattern and define how sample names should be determined from file names by specifyign a regular expression in group.
Sample index barcodes may specified by providing a filename to a CSV file containing the columns Sample_ID and Index or directly by defining a dictionary which maps indices to sample names.
- Connections:
- Output Connection:
- ‘out/first_read’
- ‘out/second_read’
- Output Connection:
- Options:
- first_read (str, required) – Part of the file name that marks all files containing sequencing data of the first read. Example: ‘R1.fastq’ or ‘_1.fastq’
- group (str, optional) – A regular expression which is applied to found files, and which is used to determine the sample name from the file name. For example,
(Sample_\d+)_R[12].fastq.gz
, when applied to a file calledSample_1_R1.fastq.gz
, would result in a sample name ofSample_1
. You can specify multiple capture groups in the regular expression. - indices (str/dict, optional) – path to a CSV file or a dictionary of sample_id: barcode entries.
- paired_end (bool, required) – Specify whether the samples are paired end or not.
- pattern (str, optional) – A file name pattern, for example
/home/test/fastq/Sample_*.fastq.gz
. - sample_id_prefix (str, optional) – This optional prefix is prepended to every sample name.
- sample_to_files_map (dict/str, optional) – A listing of sample names and their associated files. This must be provided as a YAML dictionary.
- second_read (str, required) – Part of the file name that marks all files containing sequencing data of the second read. Example: ‘R2.fastq’ or ‘_2.fastq’
This step provides input files which already exists and therefore creates no tasks in the pipeline.
fetch_chrom_sizes_source¶
- Connections:
- Output Connection:
- ‘out/chromosome_sizes’
- Output Connection:
- Options:
- path (str, required) – directory to move file to
- ucsc-database (str, required) – Name of UCSC database e.g. hg38, mm9
Required tools: cp, fetchChromSizes
This step provides input files which already exists and therefore creates no tasks in the pipeline.
raw_file_source¶
- Connections:
- Output Connection:
- ‘out/raw’
- Output Connection:
- Options:
- group (str, optional) – A regular expression which is applied to found files, and which is used to determine the sample name from the file name. For example, (Sample_d+)_R[12].fastq.gz`, when applied to a file called
Sample_1_R1.fastq.gz
, would result in a sample name ofSample_1
. You can specify multiple capture groups in the regular expression. - pattern (str, optional) – A file name pattern, for example
/home/test/fastq/Sample_*.fastq.gz
. - sample_id_prefix (str, optional) – This optional prefix is prepended to every sample name.
- sample_to_files_map (dict/str, optional) – A listing of sample names and their associated files. This must be provided as a YAML dictionary.
- group (str, optional) – A regular expression which is applied to found files, and which is used to determine the sample name from the file name. For example, (Sample_d+)_R[12].fastq.gz`, when applied to a file called
This step provides input files which already exists and therefore creates no tasks in the pipeline.
raw_file_sources¶
The RawFileSources class acts as a tyemporary fix to get files into the pipeline. This source creates a run for every sample.
Specify a file name pattern in pattern and define how sample names should be determined from file names by specifyign a regular expression in group.
- Connections:
- Output Connection:
- ‘out/raws’
- Output Connection:
- Options:
- group (str, required) – This is a LEGACY step. Do NOT use it, better use the
raw_file_source
step. A regular expression which is applied to found files, and which is used to determine the sample name from the file name. For example,(Sample_\d+)_R[12].fastq.gz
, when applied to a file calledSample_1_R1.fastq.gz
, would result in a sample name ofSample_1
. You can specify multiple capture groups in the regular expression. - paired_end (bool, required) – Specify whether the samples are paired end or not.
- pattern (str, required) – A file name pattern, for example
/home/test/fastq/Sample_*.fastq.gz
. - sample_id_prefix (str, optional) – This optional prefix is prepended to every sample name.
- group (str, required) – This is a LEGACY step. Do NOT use it, better use the
This step provides input files which already exists and therefore creates no tasks in the pipeline.
raw_url_source¶
- Connections:
- Output Connection:
- ‘out/raw’
- Output Connection:
- Options:
- filename (str, optional) – local file name of downloaded file
- hashing-algorithm (str, optional) – hashing algorithm to use
- possible values: ‘md5’, ‘sha1’, ‘sha224’, ‘sha256’, ‘sha384’, ‘sha512’
- path (str, required) – directory to move downloaded file to
- secure-hash (str, optional) – expected secure hash of downloaded file
- uncompress (bool, optional) – File is uncompressed after download
- url (str, required) – Download URL
Required tools: compare_secure_hashes, cp, curl, dd, mkdir, pigz
This step provides input files which already exists and therefore creates no tasks in the pipeline.
raw_url_sources¶
- Connections:
- Output Connection:
- ‘out/raw’
- Output Connection:
- Options:
- run-download-info (dict, required) – Dictionary of dictionaries. The keys are the names of the runs. The values are dictionaries whose keys are identical with the options of an ‘raw_url_source’ source step. An example: <name>: filename: <filename> hashing-algorithm: <hashing-algorithm> path: <path> secure-hash: <secure-hash> uncompress: <uncompress> url: <url>
Required tools: compare_secure_hashes, cp, curl, dd, mkdir, pigz
This step provides input files which already exists and therefore creates no tasks in the pipeline.
run_folder_source¶
This source looks for fastq.gz files in[path]/Unaligned/Project_*/Sample_*
and pulls additional information from CSV sample sheets it finds. It also makes sure that index information for all samples is coherent and unambiguous.
- Connections:
- Output Connection:
- ‘out/first_read’
- ‘out/second_read’
- Output Connection:
- Options:
- first_read (str, required) – Part of the file name that marks all files containing sequencing data of the first read. Example: ‘_R1.fastq’ or ‘_1.fastq’
- default value: _R1
- paired_end (bool, required) - path (str, required) - project (str, required) - default value: *
- second_read (str, required) – Part of the file name that marks all files containing sequencing data of the second read. Example: ‘R2.fastq’ or ‘_2.fastq’
- default value: _R2
- first_read (str, required) – Part of the file name that marks all files containing sequencing data of the first read. Example: ‘_R1.fastq’ or ‘_1.fastq’
This step provides input files which already exists and therefore creates no tasks in the pipeline.
Processing steps¶
bam_to_bedgraph_and_bigwig¶
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/bedgraph’
- ‘out/bigwig’
- Input Connection:
- Options:
- chromosome-sizes (str, required) - temp-sort-dir (str, optional)**Required tools:** bedGraphToBigWig, bedtools, sort
CPU Cores: 8
bam_to_genome_browser¶
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- Input Connection:
- Options:
- bedtools-bamtobed-color (str, optional) - bedtools-bamtobed-tag (str, optional) - bedtools-genomecov-3 (bool, optional) - bedtools-genomecov-5 (bool, optional) - bedtools-genomecov-max (int, optional) - bedtools-genomecov-report-zero-coverage (bool, required) - bedtools-genomecov-scale (float, optional) - bedtools-genomecov-split (bool, required) - default value: True
- bedtools-genomecov-strand (str, optional) - possible values: ‘+’, ‘-‘
- chromosome-sizes (str, required) - dd-blocksize (str, optional) - default value: 256k
- output-format (str, required) - default value: bigWig - possible values: ‘bed’, ‘bigBed’, ‘bedGraph’, ‘bigWig’
- trackline (dict, optional) - trackopts (dict, optional)**Required tools:** bedGraphToBigWig, bedToBigBed, bedtools, dd, mkfifo, pigz
CPU Cores: 8
bowtie2¶
Bowtie2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
typical command line:
bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} -S [<hit>]
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/alignments’
- Input Connection:
- Options:
- dd-blocksize (str, optional) - default value: 256k
- index (str, required) – Path to bowtie2 index (not containing file suffixes).
Required tools: bowtie2, dd, mkfifo, pigz
CPU Cores: 6
bowtie2_generate_index¶
bowtie2-build builds a Bowtie index from a set of DNA sequences. bowtie2-build outputs a set of 6 files with suffixes .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. In the case of a large index these suffixes will have a bt2l termination. These files together constitute the index: they are all that is needed to align reads to that reference. The original sequence FASTA files are no longer used by Bowtie 2 once the index is built.
http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#the-bowtie2-build-indexer
typical command line:
bowtie2-build [options]* <reference_in> <bt2_index_base>
- Connections:
- Input Connection:
- ‘in/reference_sequence’
- Output Connection:
- ‘out/bowtie_index’
- Input Connection:
- Options:
- bmax (int, optional) – The maximum number of suffixes allowed in a block. Allowing more suffixes per block makes indexing faster, but increases peak memory usage. Setting this option overrides any previous setting for –bmax, or –bmaxdivn. Default (in terms of the –bmaxdivn parameter) is –bmaxdivn 4. This is configured automatically by default; use -a/–noauto to configure manually.
- bmaxdivn (int, optional) – The maximum number of suffixes allowed in a block, expressed as a fraction of the length of the reference. Setting this option overrides any previous setting for –bmax, or –bmaxdivn. Default: –bmaxdivn 4. This is configured automatically by default; use -a/–noauto to configure manually.
- cutoff (int, optional) – Index only the first <int> bases of the reference sequences (cumulative across sequences) and ignore the rest.
- dcv (int, optional) – Use <int> as the period for the difference-cover sample. A larger period yields less memory overhead, but may make suffix sorting slower, especially if repeats are present. Must be a power of 2 no greater than 4096. Default: 1024. This is configured automatically by default; use -a/–noauto to configure manually.
- dd-blocksize (str, optional) - default value: 256k
- ftabchars (int, optional) – The ftab is the lookup table used to calculate an initial Burrows-Wheeler range with respect to the first <int> characters of the query. A larger <int> yields a larger lookup table but faster query times. The ftab has size 4^(<int>+1) bytes. The default setting is 10 (ftab is 4MB).
- index-basename (str, required) – Base name used for the bowtie2 index.
- large-index (bool, optional) – Force bowtie2-build to build a large index, even if the reference is less than ~ 4 billion nucleotides long.
- noauto (bool, optional) – Disable the default behavior whereby bowtie2-build automatically selects values for the –bmax, –dcv and –packed parameters according to available memory. Instead, user may specify values for those parameters. If memory is exhausted during indexing, an error message will be printed; it is up to the user to try new parameters.
- nodc (bool, optional) – Disable use of the difference-cover sample. Suffix sorting becomes quadratic-time in the worst case (where the worst case is an extremely repetitive reference). Default: off.
- offrate (int, optional) – To map alignments back to positions on the reference sequences, it’s necessary to annotate (‘mark’) some or all of the Burrows-Wheeler rows with their corresponding location on the genome. -o/–offrate governs how many rows get marked: the indexer will mark every 2^<int> rows. Marking more rows makes reference-position lookups faster, but requires more memory to hold the annotations at runtime. The default is 5 (every 32nd row is marked; for human genome, annotations occupy about 340 megabytes).
- packed (bool, optional) – Use a packed (2-bits-per-nucleotide) representation for DNA strings. This saves memory but makes indexing 2-3 times slower. Default: off. This is configured automatically by default; use -a/–noauto to configure manually.
- seed (int, optional) – Use <int> as the seed for pseudo-random number generator.
Required tools: bowtie2-build, dd, pigz
CPU Cores: 6
bwa_backtrack¶
bwa-backtrack is the bwa algorithm designed for Illumina sequence reads up to 100bp. The computation of the alignments is done by running ‘bwa aln’ first, to align the reads, followed by running ‘bwa samse’ or ‘bwa sampe’ afterwards to generate the final SAM output.
http://bio-bwa.sourceforge.net/
typical command line for single-end data:
bwa aln <bwa-index> <first-read.fastq> > <first-read.sai> bwa samse <bwa-index> <first-read.sai> <first-read.fastq> > <sam-output>typical command line for paired-end data:
bwa aln <bwa-index> <first-read.fastq> > <first-read.sai> bwa aln <bwa-index> <second-read.fastq> > <second-read.sai> bwa sampe <bwa-index> <first-read.sai> <second-read.sai> <first-read.fastq> <second-read.fastq> > <sam-output>
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/alignments’
- Input Connection:
- Options:
- aln-0 (bool, optional) – When aln-b is specified, only use single-end reads in mapping.
- aln-1 (bool, optional) – When aln-b is specified, only use the first read in a read pair in mapping (skip single-end reads and the second reads).
- aln-2 (bool, optional) – When aln-b is specified, only use the second read in a read pair in mapping.
- aln-B (int, optional) – Length of barcode starting from the 5’-end. When INT is positive, the barcode of each read will be trimmed before mapping and will be written at the BC SAM tag. For paired-end reads, the barcode from both ends are concatenated. [0]
- aln-E (int, optional) – Gap extension penalty [4]
- aln-I (bool, optional) – The input is in the Illumina 1.3+ read format (quality equals ASCII-64).
- aln-M (int, optional) – Mismatch penalty. BWA will not search for suboptimal hits with a score lower than (bestScore-misMsc). [3]
- aln-N (bool, optional) – Disable iterative search. All hits with no more than maxDiff differences will be found. This mode is much slower than the default.
- aln-O (int, optional) – Gap open penalty [11]
- aln-R (int, optional) – Proceed with suboptimal alignments if there are no more than INT equally best hits. This option only affects paired-end mapping. Increasing this threshold helps to improve the pairing accuracy at the cost of speed, especially for short reads (~32bp).
- aln-b (bool, optional) – Specify the input read sequence file is the BAM format. For paired-end data, two ends in a pair must be grouped together and options aln-1 or aln-2 are usually applied to specify which end should be mapped. Typical command lines for mapping pair-end data in the BAM format are:
bwa aln ref.fa -b1 reads.bam > 1.sai bwa aln ref.fa -b2 reads.bam > 2.sai bwa sampe ref.fa 1.sai 2.sai reads.bam reads.bam > aln.sam
- aln-c (bool, optional) – Reverse query but not complement it, which is required for alignment in the color space. (Disabled since 0.6.x)
- aln-d (int, optional) – Disallow a long deletion within INT bp towards the 3’-end [16]
- aln-e (int, optional) – Maximum number of gap extensions, -1 for k-difference mode (disallowing long gaps) [-1]
- aln-i (int, optional) – Disallow an indel within INT bp towards the ends [5]
- aln-k (int, optional) – Maximum edit distance in the seed [2]
- aln-l (int, optional) – Take the first INT subsequence as seed. If INT is larger than the query sequence, seeding will be disabled. For long reads, this option is typically ranged from 25 to 35 for ‘-k 2’. [inf]
- aln-n (float, optional) – Maximum edit distance if the value is INT, or the fraction of missing alignments given 2% uniform base error rate if FLOAT. In the latter case, the maximum edit distance is automatically chosen for different read lengths. [0.04]
- aln-o (int, optional) – Maximum number of gap opens [1]
- aln-q (int, optional) – Parameter for read trimming. BWA trims a read down to argmax_x{sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length. [0]
- aln-t (int, optional) – Number of threads (multi-threading mode) [1]
- default value: 6
- dd-blocksize (str, optional) - default value: 256k
- index (str, required) – Path to BWA index
- sampe-N (int, optional) – Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]
- sampe-P (bool, optional) – Load the entire FM-index into memory to reduce disk operations (base-space reads only). With this option, at least 1.25N bytes of memory are required, where N is the length of the genome.
- sampe-a (int, optional) – Maximum insert size for a read pair to be considered being mapped properly. Since 0.4.5, this option is only used when there are not enough good alignment to infer the distribution of insert sizes. [500]
- sampe-n (int, optional) – Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]
- sampe-o (int, optional) – Maximum occurrences of a read for pairing. A read with more occurrneces will be treated as a single-end read. Reducing this parameter helps faster pairing. [100000]
- sampe-r (str, optional) – Specify the read group in a format like '@RG ID:foo SM:bar’. [null]
- samse-n (int, optional) – Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than INT hits, the XA tag will not be written. [3]
- samse-r (str, optional) – Specify the read group in a format like '@RG ID:foo SM:bar’. [null]
Required tools: bwa, dd, mkfifo, pigz
CPU Cores: 8
bwa_generate_index¶
This step generates the index database from sequences in the FASTA format.
Typical command line:
bwa index -p <index-basename> <seqeunce.fasta>
- Connections:
- Input Connection:
- ‘in/reference_sequence’
- Output Connection:
- ‘out/bwa_index’
- Input Connection:
- Options:
- index-basename (str, required) – Prefix of the created index database
Required tools: bwa
CPU Cores: 6
bwa_mem¶
Align 70bp-1Mbp query sequences with the BWA-MEM algorithm. Briefly, the algorithm works by seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW).
http://bio-bwa.sourceforge.net/bwa.shtml
Typical command line:
bwa mem [options] <bwa-index> <first-read.fastq> [<second-read.fastq>] > <sam-output>
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/alignments’
- Input Connection:
- Options:
- A (int, optional) – score for a sequence match, which scales options -TdBOELU unless overridden [1]
- B (int, optional) – penalty for a mismatch [4]
- C (bool, optional) – append FASTA/FASTQ comment to SAM output
- D (float, optional) – drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
- E (str, optional) – gap extension penalty; a gap of size k cost ‘{-O} + {-E}*k’ [1,1]
- H (str, optional) – insert STR to header if it starts with @; or insert lines in FILE [null]
- L (str, optional) – penalty for 5’- and 3’-end clipping [5,5]
- M (str, optional) – mark shorter split hits as secondary
- O (str, optional) – gap open penalties for deletions and insertions [6,6]
- P (bool, optional) – skip pairing; mate rescue performed unless -S also in use
- R (str, optional) – read group header line such as '@RG ID:foo SM:bar’ [null]
- S (bool, optional) – skip mate rescue
- T (int, optional) – minimum score to output [30]
- U (int, optional) – penalty for an unpaired read pair [17]
- V (bool, optional) – output the reference FASTA header in the XR tag
- W (int, optional) – discard a chain if seeded bases shorter than INT [0]
- Y (str, optional) – use soft clipping for supplementary alignments
- a (bool, optional) – output all alignments for SE or unpaired PE
- c (int, optional) – skip seeds with more than INT occurrences [500]
- d (int, optional) – off-diagonal X-dropoff [100]
- dd-blocksize (str, optional) - default value: 256k
- e (bool, optional) – discard full-length exact matches
- h (str, optional) – if there are <INT hits with score >80% of the max score, output all in XA [5,200]
- index (str, required) – Path to BWA index
- j (bool, optional) – treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
- k (int, optional) – minimum seed length [19]
- m (int, optional) – perform at most INT rounds of mate rescues for each read [50]
- p (bool, optional) – smart pairing (ignoring in2.fq)
- r (float, optional) – look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
- t (int, optional) – number of threads [6]
- default value: 6
- v (int, optional) – verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]
- w (int, optional) – band width for banded alignment [100]
- x (str, optional) – read type. Setting -x changes multiple parameters unless overriden [null]
pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref) ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref) intractg: -B9 -O16 -L5 (intra-species contigs to ref)
- y (int, optional) – seed occurrence for the 3rd round seeding [20]
Required tools: bwa, dd, mkfifo, pigz
CPU Cores: 6
chromhmm_binarizebam¶
This command converts coordinates of aligned reads into binarized data form from which a chromatin state model can be learned. The binarization is based on a poisson background model. If no control data is specified the parameter to the poisson distribution is the global average number of reads per bin. If control data is specified the global average number of reads is multiplied by the local enrichment for control reads as determined by the specified parameters. Optionally intermediate signal files can also be outputted and these signal files can later be directly converted into binary form using the BinarizeSignal command.
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/cellmarkfiletable’
- ‘out/chromhmm_binarization’
- Input Connection:
- Options:
- b (int, optional) – The number of base pairs in a bin determining the resolution of the model learning and segmentation. By default this parameter value is set to 200 base pairs.
- cell_mark_files (dict, required) – A dictionary where the keys are the names of the run and the values are lists of lists. The lists of lists describe the content of a ‘cellmarkfiletable’ files as used by ‘BinarizeBam’. But instead of file names use the run ID for the mark and control per line. That is a tab delimited file where each row contains the cell type or other identifier for a groups of marks, then the associated mark, then the name of a BAM file, and optionally a corresponding control BAM file. If a mark is missing in one cell type, but not others it will receive a 2 for all entries in the binarization file and -1 in the signal file. If the same cell and mark combination appears on multiple lines, then the union of all the reads across entries is taken except for control data where each unique file is only counted once.
- center (bool, optional) – If this flag is present then the center of the interval is used to determine the bin to assign a read. This can make sense to use if the coordinates are based on already extended reads. If this option is selected, then the strand information of a read and the shift parameter are ignored. By default reads are assigned to a bin based on the position of its 5’ end as determined from the strand of the read after shifting an amount determined by the -n shift option.
- chrom_sizes_file (str, required) - e (int, optional) – Specifies the amount that should be subtracted from the end coordinate of a read so that both coordinates are inclusive and 0 based. The default value is 1 corresponding to standard bed convention of the end interval being 0-based but not inclusive.
- f (int, optional) – This indicates a threshold for the fold enrichment over expected that must be met or exceeded by the observed count in a bin for a present call. The expectation is determined in the same way as the mean parameter for the poission distribution in terms of being based on a uniform background unless control data is specified. This parameter can be useful when dealing with very deeply and/or unevenly sequenced data. By default this parameter value is 0 meaning effectively it is not used.
- g (int, optional) – This indicates a threshold for the signal that must be met or exceeded by the observed count in a bin for a present call. This parameter can be useful when desiring to directly place a threshold on the signal. By default this parameter value is 0 meaning effectively it is not used.
- n (int, optional) – The number of bases a read should be shifted to determine a bin assignment. Bin assignment is based on the 5’ end of a read shifted this amount with respect to the strand orientation. By default this value is 100.
- p (float, optional) – This option specifies the tail probability of the poisson distribution that the binarization threshold should correspond to. The default value of this parameter is 0.0001.
- s (int, optional) – The amount that should be subtracted from the interval start coordinate so the interval is inclusive and 0 based. Default is 0 corresponding to the standard bed convention.
- strictthresh (bool, optional) – If this flag is present then the poisson threshold must be strictly greater than the tail probability, otherwise by default the largest integer count for which the tail includes the poisson threshold probability is used.
- u (int, optional) – An integer pseudocount that is uniformly added to every bin in the control data in order to smooth the control data from 0. The default value is 1.
- w (int, optional) – This determines the extent of the spatial smoothing in computing the local enrichment for control reads. The local enrichment for control signal in the x-th bin on the chromosome after adding pseudocountcontrol is computed based on the average control counts for all bins within x-w and x+w. If no controldir is specified, then this option is ignored. The default value is 5.
Required tools: ChromHMM, ln, ls, mkdir, printf, tar, xargs
CPU Cores: 4
chromhmm_learnmodel¶
This command takes a directory with a set of binarized data files and learns a chromatin state model. Binarized data files have “_binary” in the file name. The format for the binarized data files are that the first line contains the name of the cell separated by a tab with the name of the chromosome. The second line contains in tab delimited form the name of each mark. The remaining lines correspond to consecutive bins on the chromosome. The remaining lines in tab delimited form corresponding to each mark, with a “1” for a present call or “0” for an absent call and a “2” if the data is considered missing at that interval for the mark.
- Connections:
- Input Connection:
- ‘in/cellmarkfiletable’
- ‘in/chromhmm_binarization’
- Output Connection:
- ‘out/chromhmm_model’
- Input Connection:
- Options:
- assembly (str, required) – specifies the genome assembly. overlap and neighborhood enrichments will be called with default parameters using this genome assembly.Assembly names are e.g. hg18, hg19, GRCh38
- b (int, optional) – The number of base pairs in a bin determining the resolution of the model learning and segmentation. By default this parameter value is set to 200 base pairs.
- color (str, optional) – This specifies the color of the heat map. “r,g,b” are integer values between 0 and 255 separated by commas. By default this parameter value is 0,0,255 corresponding to blue.
- d (float, optional) – The threshold on the change on the estimated log likelihood that if it falls below this value, then parameter training will terminate. If this value is less than 0 then it is not used as part of the stopping criteria. The default value for this parameter is 0.001.
- e (float, optional) – This parameter is only applicable if the load option is selected for the init parameter. This parameter controls the smoothing away from 0 when loading a model. The emission value used in the model initialization is a weighted average of the value in the file and a uniform probability over the two possible emissions. The value in the file gets weight (1-loadsmoothemission) while uniform gets weight loadsmoothemission. The default value of this parameter is 0.02.
- h (float, optional) – A smoothing constant away from 0 for all parameters in the information based initialization. This option is ignored if random or load are selected for the initialization method. The default value of this parameter is 0.02.
- holdcolumnorder (bool, optional) – Including this flag suppresses the reordering of the mark columns in the emission parameter table display.
- init (str, optional) – This specifies the method for parameter initialization method. ‘information’ is the default method described in (Ernst and Kellis, Nature Methods 2012). ‘random’ - randomly initializes the parameters from a uniform distribution. ‘load’ loads the parameters specified in ‘-m modelinitialfile’ and smooths them based on the value of the ‘loadsmoothemission’ and ‘loadsmoothtransition’ parameters. The default is information.
- possible values: ‘information’, ‘random’, ‘load’
- l (str, optional) – This file specifies the length of the chromosomes. It is a two column tab delimited file with the first column specifying the chromosome name and the second column the length. If this file is provided then no end coordinate will exceed what is specified in this file. By default BinarizeBed excludes the last partial bin along the chromosome, but if that is included in the binarized data input files then this file should be included to give a valid end coordinate for the last interval.
- m (str, optional) – This specifies the model file containing the initial parameters which can then be used with the load option
- nobed (bool, optional) – If this flag is present, then this suppresses the printing of segmentation information in the four column format. The default is to generate a four column segmentation file
- nobrowser (bool, optional) – If this flag is present, then browser files are not printed. If -nobed is requested then browserfile writing is also suppressed.
- noenrich (bool, optional) – If this flag is present, then enrichment files are not printed. If -nobed is requested then enrichment file writing is also suppressed.
- numstates (int, required) - r (int, optional) – This option specifies the maximum number of iterations over all the input data in the training. By default this is set to 200.
- s (int, optional) – This allows the specification of the random seed. Randomization is used to determine the visit order of chromosomes in the incremental expectation-maximization algorithm used to train the parameters and also used to generate the initial values of the parameters if random is specified for the init method.
- stateordering (str, optional) – This determines whether the states are ordered based on the emission or transition parameters. See (Ernst and Kellis, Nature Methods) for details. Default is ‘emission’.
- possible values: ‘emission’, ‘transition’
- t (float, optional) – This parameter is only applicable if the load option is selected for the init parameter. This parameter controls the smoothing away from 0 when loading a model. The transition value used in the model initialization is a weighted average of the value in the file and a uniform probability over the transitions. The value in the file gets weight (1-loadsmoothtransition) while uniform gets weight loadsmoothtransition. The default value is 0.5.
- x (int, optional) – This parameter specifies the maximum number of seconds that can be spent optimizing the model parameters. If it is less than 0, then there is no limit and termination is based on maximum number of iterations or a log likelihood change criteria. The default value of this parameter is -1.
- z (int, optional) – This parameter determines the threshold at which to set extremely low transition probabilities to 0 durining training. Setting extremely low transition probabilities makes model learning more efficient with essentially no impact on the final results. If a transition probability falls below 10^-zerotransitionpower during training it is set to 0. Making this parameter to low and thus the cutoff too high can potentially cause some numerical instability. By default this parameter is set to 8.
Required tools: ChromHMM, ls, mkdir, rm, tar, xargs
CPU Cores: 8
cuffcompare¶
CuffCompare is part of the ‘Cufflinks suite of tools’ for differential expr. analysis of RNA-Seq data and their visualisation. This step compares a cufflinks assembly to known annotation. For details about cuffcompare we refer to the author’s webpage:
- Connections:
- Input Connection:
- ‘in/features’
- Output Connection:
- ‘out/features’
- ‘out/loci’
- ‘out/log_stderr’
- ‘out/stats’
- ‘out/tracking’
- Input Connection:
- Options:
- ref-gtf (str, optional) – A “reference” annotation GTF. The input assemblies are merged together with the reference GTF and included in the final output.
Required tools: cuffcompare
CPU Cores: 1
cufflinks¶
CuffLinks is part of the ‘Cufflinks suite of tools’ for differential expr. analysis of RNA-Seq data and their visualisation. This step applies the cufflinks tool which assembles transcriptomes from RNA-Seq data and quantifies their expression and produces .gtf files with these annotations. For details on cufflinks we refer to the author’s webpage:
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/features’
- ‘out/genes-fpkm’
- ‘out/isoforms_fpkm’
- ‘out/log_stderr’
- ‘out/skipped’
- Input Connection:
- Options:
- 3-overhang-tolerance (int, optional) – overhang allowed on 3’ end when merging with reference
- GTF (bool, optional) – quantitate against reference transcript annotations
- GTF-guide (bool, optional) – use reference transcript annotation to guide assembly
- compatible-hits-norm (bool, optional) – count hits compatible with reference RNAs only
- frag-bias-correct (str, optional) – use bias correction - reference fasta required
- frag-len-mean (int, optional) – average fragment length (unpaired reads only)
- frag-len-std-dev (int, optional) – fragment length std deviation (unpaired reads only)
- intron-overhang-tolerance (int, optional) – overhang allowed inside reference intron when merging
- junc-alpha (float, optional) – alpha for junction binomial test filter
- label (str, optional) – assembled transcripts have this ID prefix
- library-norm-method (str, optional) – Method used to normalize library sizes
- possible values: ‘classic-fpkm’
- library-type (str, required) – library prep used for input reads
- possible values: ‘ff-firststrand’, ‘ff-secondstrand’, ‘ff-unstranded’, ‘fr-firststrand’, ‘fr-secondstrand’, ‘fr-unstranded’, ‘transfrags’
- mask-file (str, optional) – ignore all alignment within transcripts in this file
- max-bundle-frags (int, optional) – maximum fragments allowed in a bundle before skipping
- max-bundle-length (int, optional) – maximum genomic length allowed for a given bundle
- max-frag-multihits (str, optional) – Maximum number of alignments allowed per fragment
- max-intron-length (int, optional) – ignore alignments with gaps longer than this
- max-mle-iterations (int, optional) – maximum iterations allowed for MLE calculation
- max-multiread-fraction (float, optional) – maximum fraction of allowed multireads per transcript
- min-frags-per-transfrag (int, optional) – minimum number of fragments needed for new transfrags
- min-intron-length (int, optional) – minimum intron size allowed in genome
- min-isoform-fraction (float, optional) – suppress transcripts below this abundance level
- multi-read-correct (bool, optional) – use ‘rescue method’ for multi-reads (more accurate)
- no-effective-length-correction (bool, optional) – No effective length correction
- no-faux-reads (bool, optional) – disable tiling by faux reads
- no-length-correction (bool, optional) – No length correction
- no-update-check (bool, optional) – do not contact server to check for update availability
- num-frag-assign-draws (int, optional) – Number of fragment assignment samples per generation
- num-frag-count-draws (int, optional) – Number of fragment generation samples
- num-threads (int, optional) – number of threads used during analysis
- overhang-tolerance (int, optional) – number of terminal exon bp to tolerate in introns
- overlap-radius (int, optional) – maximum gap size to fill between transfrags (in bp)
- pre-mrna-fraction (float, optional) – suppress intra-intronic transcripts below this level
- seed (int, optional) – value of random number generator seed
- small-anchor-fraction (float, optional) – percent read overhang taken as ‘suspiciously small’
- total-hits-norm (bool, optional) – count all hits for normalization
- trim-3-avgcov-thresh (int, optional) – minimum avg coverage required to attempt 3’ trimming
- trim-3-dropoff-frac (float, optional) – fraction of avg coverage below which to trim 3’ end
- verbose (bool, optional) – log-friendly verbose processing (no progress bar)
Required tools: cufflinks, mkdir, mv
CPU Cores: 6
cuffmerge¶
CuffMerge is part of the ‘Cufflinks suite of tools’ for differential expr. analysis of RNA-Seq data and their visualisation. This step applies the cuffmerge tool which merges several Cufflinks assemblies. For details on cuffmerge we refer to the author’s webpage:
- Connections:
- Input Connection:
- ‘in/features’
- Output Connection:
- ‘out/assemblies’
- ‘out/features’
- ‘out/log_stderr’
- ‘out/run_log’
- Input Connection:
- Options:
- num-threads (int, optional) – Use this many threads to merge assemblies.
- default value: 6
- ref-gtf (str, optional) – A “reference” annotation GTF. The input assemblies are merged together with the reference GTF and included in the final output.
- ref-sequence (str, optional) – This argument should point to the genomic DNA sequences for the reference. If a directory, it should contain one fasta file per contig. If a multifasta file, all contigs should be present.
- run_id (str, optional) – An arbitrary name of the new run (which is a merge of all samples).
- default value: magic
- num-threads (int, optional) – Use this many threads to merge assemblies.
Required tools: cuffmerge, mkdir, mv, printf
CPU Cores: 6
cutadapt¶
Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/first_read’
- ‘out/log_first_read’
- ‘out/log_second_read’
- ‘out/second_read’
- Input Connection:
- Options:
- adapter-R1 (str, optional) – Adapter sequence to be clipped off of thefirst read.
- adapter-R2 (str, optional) – Adapter sequence to be clipped off of thesecond read
- adapter-file (str, optional) – File containing adapter sequences to be clipped off of the reads.
- adapter-type (str, optional) – a: 3’ adapter, b: 3’ or 5’ adapter, g: 5’ adapter
- default value: -a
- possible values: ‘-a’, ‘-g’, ‘-b’
- dd-blocksize (str, optional) - default value: 256k
- fix_qnames (bool, required) – If set to true, only the leftmost string without spaces of the QNAME field of the FASTQ data is kept. This might be necessary for downstream analysis.
- use_reverse_complement (bool, required) – The reverse complement of adapter sequences ‘adapter-R1’ and ‘adapter-R2’ are used for adapter clipping.
Required tools: cat, cutadapt, dd, fix_qnames, mkfifo, pigz
CPU Cores: 4
discardLargeSplitsAndPairs¶
discardLargeSplitsAndPairs reads SAM formatted alignments of the mapped reads. It discards all split reads that skip more than splits_N nucleotides in their alignment to the ref genome. In addition, all read pairs that are mapped to distant region such that the final template will exceed N_mates nucleotides will also be discarded. All remaining reads are returned in SAM format. The discarded reads are also collected in a SAM formatted file and a statistic is returned.
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- ‘out/log’
- ‘out/stats’
- Input Connection:
- Options:
- M_mates (str, required) – Size of template (in nucleotides) that would arise from a read pair. Read pairs that exceed this value are discarded.
- N_splits (str, required) – Size of the skipped region within a split read (in nucleotides). Split Reads that skip more nt than this value are discarded.
Required tools: dd, discardLargeSplitsAndPairs, pigz, samtools
CPU Cores: 4
fastqc¶
The fastqc step is a wrapper for the fastqc tool. It generates some quality metrics for fastq files. For this specific instance only the zip archive is preserved.
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/first_read_fastqc_report’
- ‘out/first_read_fastqc_report_webpage’
- ‘out/first_read_log_stderr’
- ‘out/second_read_fastqc_report’
- ‘out/second_read_fastqc_report_webpage’
- ‘out/second_read_log_stderr’
- Input Connection:
Required tools: fastqc, mkdir, mv
CPU Cores: 1
fastx_quality_stats¶
fastx_quality_stats generates a text file containing quality information of the input FASTQ data.
Documentation:
http://hannonlab.cshl.edu/fastx_toolkit/
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/first_read_quality_stats’
- ‘out/second_read_quality_stats’
- Input Connection:
- Options:
- dd-blocksize (str, optional) - default value: 256k
- new_output_format (bool, optional) - default value: True
- quality (int, optional) - default value: 33
Required tools: cat, dd, fastx_quality_stats, mkfifo, pigz
CPU Cores: 4
fix_cutadapt¶
This step takes FASTQ data and removes both reads of a paired-end read, if one of them has been completely removed by cutadapt (or any other software).
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/first_read’
- ‘out/second_read’
- Input Connection:
- Options:
- dd-blocksize (str, optional) - default value: 256k
Required tools: cat, dd, fix_cutadapt, mkfifo, pigz
CPU Cores: 4
htseq_count¶
The htseq-count script counts the number of reads overlapping a feature. Input needs to be a file with aligned sequencing reads and a list of genomic features. For more information see:
- Connections:
- Input Connection:
- ‘in/alignments’
- ‘in/features’
- Output Connection:
- ‘out/counts’
- Input Connection:
- Options:
- a (int, optional) - dd-blocksize (str, optional) - default value: 256k
- feature-file (str, optional) - idattr (str, optional) - default value: gene_id
- mode (str, optional) - default value: union - possible values: ‘union’, ‘intersection-strict’, ‘intersection-nonempty’
- order (str, required) - possible values: ‘name’, ‘pos’
- stranded (str, required) - possible values: ‘yes’, ‘no’, ‘reverse’
- type (str, optional) - default value: exon
Required tools: dd, htseq-count, pigz, samtools
CPU Cores: 2
macs2¶
Model-based Analysis of ChIP-Seq (MACS) is a algorithm, for the identifcation of transcript factor binding sites. MACS captures the influence of genome complexity to evaluate the significance of enriched ChIP regions, and MACS improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation. MACS can be easily used for ChIP-Seq data alone, or with control sample data to increase the specificity.
https://github.com/taoliu/MACS
typical command line for single-end data:
macs2 callpeak --treatment <aligned-reads> [--control <aligned-reads>] --name <run-id> --gsize 2.7e9
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/broadpeaks’
- ‘out/broadpeaks-xls’
- ‘out/diagnosis’
- ‘out/gappedpeaks’
- ‘out/log’
- ‘out/model’
- ‘out/narrowpeaks’
- ‘out/narrowpeaks-xls’
- ‘out/summits’
- Input Connection:
- Options:
- bdg (bool, optional) - broad (bool, optional) - broad-cutoff (float, optional) - buffer-size (int, optional) - bw (int, optional) - call-summits (bool, optional) - control (dict, required) - down-sample (bool, optional) - extsize (int, optional) - format (str, required) - default value: AUTO - possible values: ‘AUTO’, ‘ELAND’, ‘ELANDMULTI’, ‘ELANDMULTIPET’, ‘ELANDEXPORT’, ‘BED’, ‘SAM’, ‘BAM’, ‘BAMPE’, ‘BOWTIE’
- gsize (str, required) - default value: 2.7e9
- keep-dup (int, optional) - llocal (str, optional) - mfold (str, optional) - nolambda (bool, optional) - nomodel (bool, optional) - pvalue (float, optional) - qvalue (float, optional) - read-length (int, optional) - shift (int, optional) - slocal (str, optional) - to-large (bool, optional) - verbose (int, optional) - possible values: ‘0’, ‘1’, ‘2’, ‘3’
Required tools: macs2, mkdir, mv, pigz
CPU Cores: 4
merge_fasta_files¶
This step merges all .fasta(.gz) files belonging to a certain sample. The output files are gzipped.
- Connections:
- Input Connection:
- ‘in/sequence’
- Output Connection:
- ‘out/sequence’
- Input Connection:
- Options:
- compress-output (bool, optional) – If set to true output is gzipped.
- default value: True
- dd-blocksize (str, optional) - default value: 256k
- merge-all-runs (bool, optional) – If set to true sequences from all runs are merged
- output-fasta-basename (str, optional) – Name used as prefix for FASTA output.
- compress-output (bool, optional) – If set to true output is gzipped.
Required tools: cat, dd, mkfifo, pigz
CPU Cores: 4
merge_fastq_files¶
This step merges all .fastq(.gz) files belonging to a certain sample. First and second read files are merged separately. The output files are gzipped.
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/first_read’
- ‘out/second_read’
- Input Connection:
- Options:
- dd-blocksize (str, optional) - default value: 256k
Required tools: cat, dd, mkfifo, pigz
CPU Cores: 4
picard_add_replace_read_groups¶
Replace read groups in a BAM file. This tool enables the user to replace all read groups in the INPUT file with a single new read group and assign all reads to this read group in the OUTPUT BAM file.
Documentation:
https://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- Input Connection:
- Options:
- COMPRESSION_LEVEL (int, optional) – Compression level for all compressed files created (e.g. BAM and GELI). Default value: 5. This option can be set to “null” to clear the default value.
- CREATE_INDEX (bool, optional) – Whether to create a BAM index when writing a coordinate-sorted BAM file. Default value: false. This option can be set to “null” to clear the default value.
- CREATE_MD5_FILE (bool, optional) – Whether to create an MD5 digest for any BAM or FASTQ files created. Default value: false. This option can be set to “null” to clear the default value.
- GA4GH_CLIENT_SECRETS (str, optional) – Google Genomics API client_secrets.json file path. Default value: client_secrets.json. This option can be set to “null” to clear the default value.
- MAX_RECORDS_IN_RAM (int, optional) – When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed. Default value: 500000. This option can be set to “null” to clear the default value.
- QUIET (bool, optional) – Whether to suppress job-summary info on System.err. Default value: false. This option can be set to “null” to clear the default value.
- REFERENCE_SEQUENCE (str, optional) – Reference sequence file. Default value: null.
- RGCN (str, optional) – Read Group sequencing center name. Default value: null.
- RGDS (str, optional) – Read Group description. Default value: null.
- RGDT (str, optional) – Read Group run date. Default value: null.
- RGID (str, optional) – Read Group ID Default value: 1. This option can be set to ‘null’ to clear the default value.
- RGLB (str, required) – Read Group library
- RGPG (str, optional) – Read Group program group. Default value: null.
- RGPI (int, optional) – Read Group predicted insert size. Default value: null.
- RGPL (str, required) – Read Group platform (e.g. illumina, solid)
- RGPM (str, optional) – Read Group platform model. Default value: null.
- RGPU (str, required) – Read Group platform unit (eg. run barcode)
- SORT_ORDER (str, optional) – Optional sort order to output in. If not supplied OUTPUT is in the same order as INPUT. Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate}
- possible values: ‘unsorted’, ‘queryname’, ‘coordinate’, ‘duplicate’
- TMP_DIR (str, optional) – A file. Default value: null. This option may be specified 0 or more times.
- VALIDATION_STRINGENCY (str, optional) – Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT. This option can be set to “null” to clear the default value.
- possible values: ‘STRICT’, ‘LENIENT’, ‘SILENT’
- VERBOSITY (str, optional) – Control verbosity of logging. Default value: INFO. This option can be set to “null” to clear the default value.
- possible values: ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’
Required tools: picard-tools
CPU Cores: 6
picard_markduplicates¶
Identifies duplicate reads. This tool locates and tags duplicate reads (both PCR and optical/ sequencing-driven) in a BAM or SAM file, where duplicate reads are defined as originating from the same original fragment of DNA. Duplicates are identified as read pairs having identical 5’ positions (coordinate and strand) for both reads in a mate pair (and optinally, matching unique molecular identifier reads; see BARCODE_TAG option). Optical, or more broadly Sequencing, duplicates are duplicates that appear clustered together spatially during sequencing and can arise from optical/ imagine-processing artifacts or from bio-chemical processes during clonal amplification and sequencing; they are identified using the READ_NAME_REGEX and the OPTICAL_DUPLICATE_PIXEL_DISTANCE options. The tool’s main output is a new SAM or BAM file in which duplicates have been identified in the SAM flags field, or optionally removed (see REMOVE_DUPLICATE and REMOVE_SEQUENCING_DUPLICATES), and optionally marked with a duplicate type in the ‘DT’ optional attribute. In addition, it also outputs a metrics file containing the numbers of READ_PAIRS_EXAMINED, UNMAPPED_READS, UNPAIRED_READS, UNPAIRED_READ_DUPLICATES, READ_PAIR_DUPLICATES, and READ_PAIR_OPTICAL_DUPLICATES.
Usage example:
java -jar picard.jar MarkDuplicates I=input.bam O=marked_duplicates.bam M=marked_dup_metrics.txtDocumentation:
https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- ‘out/metrics’
- Input Connection:
- Options:
- ASSUME_SORTED (bool, optional) - COMMENT (str, optional) - COMPRESSION_LEVEL (int, optional) – Compression level for all compressed files created (e.g. BAM and GELI). Default value: 5. This option can be set to “null” to clear the default value.
- CREATE_INDEX (bool, optional) – Whether to create a BAM index when writing a coordinate-sorted BAM file. Default value: false. This option can be set to “null” to clear the default value.
- CREATE_MD5_FILE (bool, optional) – Whether to create an MD5 digest for any BAM or FASTQ files created. Default value: false. This option can be set to “null” to clear the default value.
- GA4GH_CLIENT_SECRETS (str, optional) – Google Genomics API client_secrets.json file path. Default value: client_secrets.json. This option can be set to “null” to clear the default value.
- MAX_FILE_HANDLES (int, optional) - MAX_RECORDS_IN_RAM (int, optional) – When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed. Default value: 500000. This option can be set to “null” to clear the default value.
- OPTICAL_DUPLICATE_PIXEL_DISTANCE (int, optional) - PROGRAM_GROUP_COMMAND_LINE (str, optional) - PROGRAM_GROUP_NAME (str, optional) - PROGRAM_GROUP_VERSION (str, optional) - PROGRAM_RECORD_ID (str, optional) - QUIET (bool, optional) – Whether to suppress job-summary info on System.err. Default value: false. This option can be set to “null” to clear the default value.
- READ_NAME_REGEX (str, optional) - REFERENCE_SEQUENCE (str, optional) – Reference sequence file. Default value: null.
- SORTING_COLLECTION_SIZE_RATIO (float, optional) - TMP_DIR (str, optional) – A file. Default value: null. This option may be specified 0 or more times.
- VALIDATION_STRINGENCY (str, optional) – Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT. This option can be set to “null” to clear the default value.
- possible values: ‘STRICT’, ‘LENIENT’, ‘SILENT’
- VERBOSITY (str, optional) – Control verbosity of logging. Default value: INFO. This option can be set to “null” to clear the default value.
- possible values: ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’
Required tools: picard-tools
CPU Cores: 12
picard_merge_sam_bam_files¶
Documentation:
https://broadinstitute.github.io/picard/command-line-overview.html#MergeSamFiles
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- Input Connection:
- Options:
- ASSUME_SORTED (bool, optional) – If true, assume that the input files are in the same sort order as the requested output sort order, even if their headers say otherwise. Default value: false. This option can be set to ‘null’ to clear the default value. Possible values: {true, false}
- COMMENT (str, optional) – Comment(s) to include in the merged output file’s header. Default value: null.
- COMPRESSION_LEVEL (int, optional) – Compression level for all compressed files created (e.g. BAM and GELI). Default value: 5. This option can be set to “null” to clear the default value.
- CREATE_INDEX (bool, optional) – Whether to create a BAM index when writing a coordinate-sorted BAM file. Default value: false. This option can be set to “null” to clear the default value.
- CREATE_MD5_FILE (bool, optional) – Whether to create an MD5 digest for any BAM or FASTQ files created. Default value: false. This option can be set to “null” to clear the default value.
- GA4GH_CLIENT_SECRETS (str, optional) – Google Genomics API client_secrets.json file path. Default value: client_secrets.json. This option can be set to “null” to clear the default value.
- INTERVALS (str, optional) – An interval list file that contains the locations of the positions to merge. Assume bam are sorted and indexed. The resulting file will contain alignments that may overlap with genomic regions outside the requested region. Unmapped reads are discarded. Default value: null.
- MAX_RECORDS_IN_RAM (int, optional) – When writing SAM files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort a SAM file, and increases the amount of RAM needed. Default value: 500000. This option can be set to “null” to clear the default value.
- MERGE_SEQUENCE_DICTIONARIES (bool, optional) – Merge the sequence dictionaries. Default value: false. This option can be set to ‘null’ to clear the default value. Possible values: {true, false}
- QUIET (bool, optional) – Whether to suppress job-summary info on System.err. Default value: false. This option can be set to “null” to clear the default value.
- REFERENCE_SEQUENCE (str, optional) – Reference sequence file. Default value: null.
- SORT_ORDER (str, optional) – Sort order of output file. Default value: coordinate. This option can be set to ‘null’ to clear the default value. Possible values: {unsorted, queryname, coordinate, duplicate}
- possible values: ‘unsorted’, ‘queryname’, ‘coordinate’, ‘duplicate’
- TMP_DIR (str, optional) – A file. Default value: null. This option may be specified 0 or more times.
- USE_THREADING (bool, optional) – Option to create a background thread to encode, compress and write to disk the output file. The threaded version uses about 20% more CPU and decreases runtime by ~20% when writing out a compressed BAM file. Default value: false. This option can be set to ‘null’ to clear the default value. Possible values: {true, false}
- VALIDATION_STRINGENCY (str, optional) – Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. Default value: STRICT. This option can be set to “null” to clear the default value.
- possible values: ‘STRICT’, ‘LENIENT’, ‘SILENT’
- VERBOSITY (str, optional) – Control verbosity of logging. Default value: INFO. This option can be set to “null” to clear the default value.
- possible values: ‘ERROR’, ‘WARNING’, ‘INFO’, ‘DEBUG’
Required tools: ln, picard-tools
CPU Cores: 12
post_cufflinksSuite¶
- The cufflinks suite can be used to assembly new transcripts and
- merge those with known annotations. However, the output .gtf files need to be reformatted in several aspects afterwards. This step can be used to reformat and filter the cufflinksSuite .gtf file.
- Connections:
- Input Connection:
- ‘in/features’
- Output Connection:
- ‘out/features’
- ‘out/log_stderr’
- Input Connection:
- Options:
- class_list (str, optional) – Class codes to be removed; possible ‘=,c,j,e,i,o,p,r,u,x,s,.’
- filter_by_class (bool, required) – Remove gtf if any class is found in class_code field, requieres class_list
- filter_by_class_and_gene_name (bool, required) – Combines remove-by-class and remove-by-gene-name
- gene_name (str, optional) – String to match in gtf field gene_name for discarding
- default value: ENS
- remove_by_gene_name (bool, required) – Remove gtf if matches ‘string’ in gene_name field
- remove_gencode (bool, required) – Hard removal of gtf line which match ‘ENS’ in gene_name field
- remove_unstranded (bool, required) – Removes transcripts without strand specifity
- run_id (str, optional) – An arbitrary name of the new run (which is a merge of all samples).
- default value: magic
Required tools: cat, post_cufflinks_merge
CPU Cores: 6
preseq_complexity_curve¶
The preseq package is aimed at predicting the yield of distinct reads from a genomic library from an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.
c_curve computes the expected yield of distinct reads for experiments smaller than the input experiment in a .bed or .bam file through resampling. The full set of parameters can be outputed by simply typing the program name. If output.txt is the desired output file name and input.bed is the input .bed file, then simply type:
preseq c_curve -o output.txt input.sort.bed
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/complexity_curve’
- Input Connection:
- Options:
- hist (bool, optional) – input is a text file containing the observed histogram
- pe (bool, required) – input is paired end read file
- seg_len (int, optional) – maximum segment length when merging paired end bam reads (default: 5000)
- step (int, optional) – step size gin extrapolations (default: 1e+06)
- vals (bool, optional) – input is a text file containing only the observed counts
Required tools: preseq
CPU Cores: 4
preseq_future_genome_coverage¶
The preseq package is aimed at predicting the yield of distinct reads from a genomic library from an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.
gc_extrap computes the expected genomic coverage for deeper sequencing for single cell sequencing experiments. The input should be a mr or bed file. The tool bam2mr is provided to convert sorted bam or sam files to mapped read format.
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/future_genome_coverage’
- Input Connection:
- Options:
- bin_size (int, optional) – bin size (default: 10)
- bootstraps (int, optional) – number of bootstraps (default: 100)
- cval (float, optional) – level for confidence intervals (default: 0.95)
- extrap (int, optional) – maximum extrapolation in base pairs (default: 1e+12)
- max_width (int, optional) – max fragment length, set equal to read length for single end reads
- quick (bool, optional) – quick mode: run gc_extrap without bootstrapping for confidence intervals
- step (int, optional) – step size in bases between extrapolations (default: 1e+08)
- terms (int, optional) – maximum number of terms
Required tools: preseq
CPU Cores: 4
preseq_future_yield¶
The preseq package is aimed at predicting the yield of distinct reads from a genomic library from an initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.
lc_extrap computes the expected future yield of distinct reads and bounds on the number of total distinct reads in the library and the associated confidence intervals.
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/future_yield’
- Input Connection:
- Options:
- bootstraps (int, optional) – number of bootstraps (default: 100)
- cval (float, optional) – level for confidence intervals (default: 0.95)
- dupl_level (float, optional) – fraction of duplicate to predict (default: 0.5)
- extrap (int, optional) – maximum extrapolation (default: 1e+10)
- hist (bool, optional) – input is a text file containing the observed histogram
- pe (bool, required) – input is paired end read file
- quick (bool, optional) – quick mode, estimate yield without bootstrapping for confidence intervals
- seg_len (int, optional) – maximum segment length when merging paired end bam reads (default: 5000)
- step (int, optional) – step size in extrapolations (default: 1e+06)
- terms (int, optional) – maximum number of terms
- vals (bool, optional) – input is a text file containing only the observed counts
Required tools: preseq
CPU Cores: 4
remove_duplicate_reads_runs¶
Duplicates are removed by Picard tools ‘MarkDuplicates’.
typical command line:
MarkDuplicates INPUT=<SAM/BAM> OUTPUT=<SAM/BAM> METRICS_FILE=<metrics-out> REMOVE_DUPLICATES=true
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- ‘out/metrics’
- Input Connection:
Required tools: MarkDuplicates
CPU Cores: 12
rgt_thor¶
THOR is an HMM-based approach to detect and analyze differential peaks in two sets of ChIP-seq data from distinct biological conditions with replicates. THOR performs genomic signal processing, peak calling and p-value calculation in an integrated framework. For differential peak calling without replicates use ODIN.
More information please refer to:
Allhoff, M., Sere K., Freitas, J., Zenke, M., Costa, I.G. (2016), Differential Peak Calling of ChIP-seq Signals with Replicates with THOR, Nucleic Acids Research, epub gkw680 [paper][supp].
Feel free to post your question in our googleGroup or write an e-mail: rgtusers@googlegroups.com
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/chip_seq_bigwig’
- ‘out/diff_narrow_peaks’
- ‘out/diff_peaks_thor_bed’
- ‘out/thor_config’
- ‘out/thor_setup_info’
- Input Connection:
- Options:
- binsize (int, optional) – Size of bins for creating the signal.
- chrom_sizes_file (str, required) - config_file (dict, required) – A dictionary with
- deadzones (str, optional) – Define blacklisted genomic regions to be ignored by the peak caller.
- exts (str, optional) – Read’s extension size for BAM files (comma separated list for each BAM file in config file). If option is not chosen, estimate extension sizes from reads.
- factors-inputs (str, optional) – Normalization factors for input-DNA (comma separated list for each BAM file in config file). If option is not chosen, estimate factors.
- genome (str, required) – FASTA file containing the complete genome sequence
- housekeeping-genes (str, optional) – Define housekeeping genes (BED format) used for normalizing.
- merge (bool, optional) – Merge peaks which have a distance less than the estimated mean fragment size (recommended for histone data).
- name (str, optional) – Experiment’s name and prefix for all files that are created.
- no-correction (bool, optional) – Do not use multiple test correction for p-values (Benjamini/Hochberg).
- no-gc-content (bool, optional) – Do not normalize towards GC content.
- pvalue (float, optional) – P-value cutoff for peak detection. Call only peaks with p-value lower than cutoff.
- report (bool, optional) – Generate HTML report about experiment.
- save-input (bool, optional) – Save input DNA bigwig (if input was provided).
- scaling-factors (str, optional) – Scaling factor for each BAM file (not control input-DNA) as comma separated list for each BAM file in config file. If option is not chosen, follow normalization strategy (TMM or HK approach)
- step (int, optional) – Stepsize with which the window consecutively slides across the genome to create the signal.
Required tools: printf, rgt-THOR
CPU Cores: 4
rseqc¶
The RSeQC step can be used to evaluate aligned reads in a BAM file. RSeQC does not only report raw sequence-based metrics, but also quality control metrics like read distribution, gene coverage, and sequencing depth.
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/bam_stat’
- ‘out/infer_experiment’
- ‘out/read_distribution’
- Input Connection:
- Options:
- reference (str, required) – Reference gene model in bed fomat. [required]
Required tools: bam_stat.py, cat, infer_experiment.py, read_distribution.py
CPU Cores: 1
s2c¶
s2c formats the output of segemehl mapping to be compatible with the cufflinks suite of tools for differential expr. analysis of RNA-Seq data and their visualisation. For details on cufflinks we refer to the author’s webpage:
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- ‘out/log’
- Input Connection:
- Options:
- tmp_dir (str, required) – Temp directory for ‘s2c.py’. This can be in the /work/username/ path, since it is only temporary.
Required tools: cat, dd, fix_s2c, pigz, s2c, samtools
CPU Cores: 6
sam_to_sorted_bam¶
The step sam_to_sorted_bam builds on ‘samtools sort’ to sort SAM files and output BAM files.
Sort alignments by leftmost coordinates, or by read name when -n is used. An appropriate @HD-SO sort order header tag will be added or an existing one updated if necessary.
Documentation:
http://www.htslib.org/doc/samtools.html
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- Input Connection:
- Options:
- dd-blocksize (str, optional) - default value: 256k
- genome-faidx (str, required) - sort-by-name (bool, required) - temp-sort-dir (str, required) – Intermediate sort files are stored intothis directory.
Required tools: dd, pigz, samtools
CPU Cores: 8
samtools_faidx¶
Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format.
- Connections:
- Input Connection:
- ‘in/sequence’
- Output Connection:
- ‘out/indices’
- Input Connection:
Required tools: mv, samtools
CPU Cores: 4
samtools_index¶
Index a coordinate-sorted BAM or CRAM file for fast random access. (Note that this does not work with SAM files even if they are bgzip compressed to index such files, use tabix(1) instead.)
Documentation:
http://www.htslib.org/doc/samtools.html
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- ‘out/index_stats’
- ‘out/indices’
- Input Connection:
- Options:
- index_type (str, required) - possible values: ‘bai’, ‘csi’
Required tools: ln, samtools
CPU Cores: 4
samtools_stats¶
samtools stats collects statistics from BAM files and outputs in a text format. The output can be visualized graphically using plot-bamstats.
Documentation:
http://www.htslib.org/doc/samtools.html
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/stats’
- Input Connection:
- Options:
- dd-blocksize (str, optional) - default value: 256k
Required tools: dd, pigz, samtools
CPU Cores: 1
segemehl¶
segemehl is a software to map short sequencer reads to reference genomes. Unlike other methods, segemehl is able to detect not only mismatches but also insertions and deletions. Furthermore, segemehl is not limited to a specific read length and is able to mapprimer- or polyadenylation contaminated reads correctly.
This step creates at first two FIFOs. The first is used to provide the genome data for segemehl and the second is used for the output of the unmapped reads:
mkfifo genome_fifo unmapped_fifo cat <genome-fasta> -o genome_fifoThe executed segemehl command is this:
segemehl -d genome_fifo -i <genome-index-file> -q <read1-fastq> [-p <read2-fastq>] -u unmapped_fifo -H 1 -t 11 -s -S -D 0 -o /dev/stdout | pigz --blocksize 4096 --processes 2 -cThe unmapped reads are saved via these commands:
cat unmapped_fifo | pigz --blocksize 4096 --processes 2 -c > <unmapped-fastq>
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/alignments’
- ‘out/log’
- ‘out/unmapped’
- Input Connection:
- Options:
- MEOP (bool, optional) – output MEOP field for easier variance calling in SAM (XE:Z:)
- SEGEMEHL (bool, optional) – output SEGEMEHL format (needs to be selected for brief)
- accuracy (int, optional) – min percentage of matches per read in semi-global alignment (default:90)
- autoclip (bool, optional) – autoclip unknown 3prime adapter
- bisulfite (int, optional) – bisulfite mapping with methylC-seq/Lister et al. (=1) or bs-seq/Cokus et al. protocol (=2) (default:0)
- possible values: ‘0’, ‘1’, ‘2’
- brief (bool, optional) – brief output
- clipacc (int, optional) – clipping accuracy (default:70)
- dd-blocksize (str, optional) - default value: 256k
- differences (int, optional) – search seeds initially with <n> differences (default:1)
- default value: 1
- dropoff (int, optional) – dropoff parameter for extension (default:8)
- evalue (float, optional) – max evalue (default:5.000000)
- extensionpenalty (int, optional) – penalty for a mismatch during extension (default:4)
- extensionscore (int, optional) – score of a match during extension (default:2)
- fix-qnames (bool, optional) – The QNAMES field of the input will be purged from spaces and everything thereafter.
- genome (str, required) – Path to genome file
- hardclip (bool, optional) – enable hard clipping
- hitstrategy (int, optional) – report only best scoring hits (=1) or all (=0) (default:1)
- default value: 1
- possible values: ‘0’, ‘1’
- index (str, required) – Path to genome index for segemehl
- jump (int, optional) – search seeds with jump size <n> (0=automatic) (default:0)
- maxinsertsize (int, optional) – maximum size of the inserts (paired end) (default:5000)
- maxinterval (int, optional) – maximum width of a suffix array interval, i.e. a query seed will be omitted if it matches more than <n> times (default:100)
- maxsplitevalue (float, optional) – max evalue for splits (default:50.000000)
- minfraglen (int, optional) – min length of a spliced fragment (default:20)
- minfragscore (int, optional) – min score of a spliced fragment (default:18)
- minsize (int, optional) – minimum size of queries (default:12)
- minsplicecover (int, optional) – min coverage for spliced transcripts (default:80)
- nohead (bool, optional) – do not output header
- order (bool, optional) – sorts the output by chromsome and position (might take a while!)
- polyA (bool, optional) – clip polyA tail
- prime3 (str, optional) – add 3’ adapter (default:none)
- prime5 (str, optional) – add 5’ adapter (default:none)
- showalign (bool, optional) – show alignments
- silent (bool, optional) – shut up!
- default value: True
- splicescorescale (float, optional) – report spliced alignment with score s only if <f>*s is larger than next best spliced alignment (default:1.000000)
- splits (bool, optional) – detect split/spliced reads (default:none)
- default value: True
- threads (int, optional) – start <n> threads (default:10)
- default value: 10
Required tools: cat, dd, fix_qnames, mkfifo, pigz, segemehl
CPU Cores: 10
segemehl_generate_index¶
The step segemehl_generate_index generates a index for given reference sequences.
Documentation:
http://www.bioinf.uni-leipzig.de/Software/segemehl/
- Connections:
- Input Connection:
- ‘in/reference_sequence’
- Output Connection:
- ‘out/log’
- ‘out/segemehl_index’
- Input Connection:
- Options:
- dd-blocksize (str, optional) - default value: 256k
- index-basename (str, required) – Basename for created segemehl index.
Required tools: dd, mkfifo, pigz, segemehl
CPU Cores: 4
sra_fastq_dump¶
sra tools is a suite from NCBI to handle sra (short read archive) files. fastq-dump is an sra tool that dumps the content of an sra file in fastq format
The following options cannot be set, as they would interefere with the pipeline implemented in this step
- -O|–outdir <path> Output directory, default is working
- directory ‘.’ )
- -Z|–stdout Output to stdout, all split data become
- joined into single stream
--gzip Compress output using gzip --bzip2 Compress output using bzip2
- Multiple File Options Setting these options will produce more
than 1 file, each of which will be suffixed according to splitting criteria.
--split-files Dump each read into separate file.Files will receive suffix corresponding to read number --split-3 Legacy 3-file splitting for mate-pairs: First biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq Biological reads and above are ignored. -G|–spot-group Split into files by SPOT_GROUP (member name) -R|–read-filter <[filter]> Split into files by READ_FILTER value
optionally filter by value: pass|reject|criteria|redacted-T|–group-in-dirs Split into subdirectories instead of files -K|–keep-empty-files Do not delete empty files
Details on fastq-dump can be found at https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump
To make IO cluster friendly, fastq-dump is not reading th sra file directly. Rather, dd with configurable blocksize is used to provide the sra file via a fifo to fastq-dump.
The executed calls lools like this
mkfifo sra_fifo dd bs=4M if=<sra-file> of=sra_fifo fastq-dump -Z sra_fifo | pigz –blocksize 4096 –processes 2 > file.fastq
- Connections:
- Input Connection:
- ‘in/sequence’
- Output Connection:
- ‘out/first_read’
- ‘out/log’
- ‘out/second_read’
- Input Connection:
- Options:
- accession (str, optional) – Replaces accession derived from <path> in filename(s) and deflines (only for single table dump)
- aligned (bool, optional) – Dump only aligned sequences
- aligned-region (str, optional) – Filter by position on genome. Name can either be accession.version (ex:NC_000001.10) or file specific name (ex:”chr1” or “1”). “from” and “to” are 1-based coordinates. <name[:from-to]>
- clip (bool, optional) – Apply left and right clips
- dd-blocksize (str, optional) - default value: 256k
- defline-qual (str, optional) – Defline format specification for quality.
- defline-seq (str, optional) – Defline format specification for sequence.
- disable-multithreading (bool, optional) – disable multithreading
- dumpbase (bool, optional) – Formats sequence using base space (default for other than SOLiD).
- dumpcs (bool, optional) – Formats sequence using color space (default for SOLiD),”cskey” may be specified for translation.
- fasta (int, optional) – FASTA only, no qualities, optional line wrap width (set to zero for no wrapping). <[line width]>
- helicos (bool, optional) – Helicos style defline
- legacy-report (bool, optional) – use legacy style “Written spots” for tool
- log-level (str, optional) – Logging level as number or enum string One of (fatal|sys|int|err|warn|info) or (0-5). Current/default is warn. <level>
- matepair-distance (str, optional) – Filter by distance beiween matepairs. Use “unknown” to find matepairs split between the references. Use from-to to limit matepair distance on the same reference. <from-to|unknown>
- maxSpotId (int, optional) – Maximum spot id to be dumped. Use with “minSpotId” to dump a range.
- max_cores (int, optional) – Maximum number of cores available on the cluster
- default value: 10
- minReadLen (int, optional) – Filter by sequence length >= <len>
- minSpotId (int, optional) – Minimum spot id to be dumped. Use with “maxSpotId” to dump a range.
- ncbi_error_report (str, optional) – Control program execution environment report generation (if implemented). One of (never|error|always). Default is error. <error>
- offset (int, optional) – Offset to use for quality conversion, default is 33
- origfmt (bool, optional) – Defline contains only original sequence name
- qual-filter (bool, optional) – Filter used in early 1000 Genomes data: no sequences starting or ending with >= 10N
- qual-filter-1 (bool, optional) – Filter used in current 1000 Genomes data
- read-filter (str, optional) – Split into files by READ_FILTER value optionally filter by value: pass|reject|criteria|redacted
- readids (bool, optional) – Append read id after spot id as “accession.spot.readid” on defline.
- skip-technical (bool, optional) – Dump only biological reads
- split-spot (str, optional) – Split spots into individual reads
- spot-groups (str, optional) – Filter by SPOT_GROUP (member): name[,...]
- suppress-qual-for-cskey (bool, optional) – supress quality-value for cskey
- table (str, optional) – Table name within cSRA object, default is “SEQUENCE”
- unaligned (bool, optional) – Dump only unaligned sequences
- verbose (bool, optional) – Increase the verbosity level of the program. Use multiple times for more verbosity.
Required tools: dd, fastq-dump, mkfifo, pigz
CPU Cores: 10
subsetMappedReads¶
subsetMappedReads selects a provided number of mapped reads from a file in .sam or .bam format. Depending on the set options the first N mapped reads and their mates (for paired end sequencing) are returned in .sam format. If the number of requested reads exceeds the number of available mapped reads, all mapped reads are returned.
- Connections:
- Input Connection:
- ‘in/alignments’
- Output Connection:
- ‘out/alignments’
- ‘out/log’
- Input Connection:
- Options:
- Nreads (str, required) – Number of reads to extract from input file.
- genome-faidx (str, required) - paired_end (bool, required) – The reads are expected to have a mate, due to paired end sequencing.
Required tools: cat, dd, head, pigz, samtools
CPU Cores: 1
tophat2¶
TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
typical command line:
tophat [options]* <index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]
- Connections:
- Input Connection:
- ‘in/first_read’
- ‘in/second_read’
- Output Connection:
- ‘out/align_summary’
- ‘out/alignments’
- ‘out/deletions’
- ‘out/insertions’
- ‘out/junctions’
- ‘out/log_stderr’
- ‘out/misc_logs’
- ‘out/prep_reads’
- ‘out/unmapped’
- Input Connection:
- Options:
- index (str, required) – Path to genome index for tophat2
- library_type (str, required) – The default is unstranded (fr-unstranded). If either fr-firststrand or fr-secondstrand is specified, every read alignment will have an XS attribute tag as explained below. Consider supplying library type options below to select the correct RNA-seq protocol.(https://ccb.jhu.edu/software/tophat/manual.shtml)
- possible values: ‘fr-unstranded’, ‘fr-firststrand’, ‘fr-secondstrand’
Required tools: mkdir, mv, tar, tophat2
CPU Cores: 6