Quick Start uap

At first, you need to install uap (see Installation of uap). After successfully finishing the installation of uap example analysis can be found in the folder example-configurations.

Let’s jump head first into uap and have a look at some examples:

$ cd <uap-path>/example-configurations/
$ ls *.yaml
2007-CD4+_T_Cell_ChIPseq-Barski_et_al_download.yaml
2007-CD4+_T_Cell_ChIPseq-Barski_et_al.yaml
2014-RNA_CaptureSeq-Mercer_et_al_download.yaml
2014-RNA_CaptureSeq-Mercer_et_al.yaml
download_human_gencode_release.yaml
index_homo_sapiens_hg19_genome.yaml
index_mycoplasma_genitalium_ASM2732v1_genome.yaml

These example configurations differ in their usage of computational resources. Some example configurations download or work on small datasets and are thus feasible for machines with limited resources. Most examples can be extended by uncommenting additional steps. This might change their computational requirements in such a way that a very powerful stand-alone machine or a cluster system is required. The examples are marked accordingly in the sections below.

Note

Before computing an example on a cluster, you need to uncomment the cluster Section and adapt the settings as required. Please check also if the Cluster Configuration File fits your cluster system.

Note

The examples contain information where users can obtain required external/bioinformatics tools. If uap fails due to a missing tool, please check the provided URLs for installation instructions.

Handle Genomic Data

A usual analysis of High-Throughput Sequencing (HTS) data relies on different publicly available data. Most important is probably the genomic sequence of the species under investigation. That sequence is required to construct the indices (data structures used by read aligners). Other publicly available data sets (such as reference annotations or the chromosome sizes) might also be required for an analysis. The following configurations showcase how to get or generate that data:

index_mycoplasma_genitalium_ASM2732v1_genome.yaml

Downloads the Mycoplasma genitalium genome, generates the indices for bowtie2, bwa, segemehl, and samtools. This workflow is quite fast because it uses the very small genome of Mycoplasma genitalium.

Max. memory:~0,5 GB
Disk usage:~20 MB
Run time:minutes

Required tools:

index_homo_sapiens_hg19_genome.yaml

Downloads chromosome 21 of the Homo sapiens genome, generates the indices for bowtie2, bwa, and samtools. This minimal version should work just fine. Users can uncomment steps to download the complete genome. This would substantially increase the required computational resources. The segemehl index creation is commented out due to its high memory consumption (~50-60 GB), if working with the whole genome.

Max. memory:~2 GB
Disk usage:~240 MB
Run time:several minutes

Downloads Homo sapiens chromosome 21, generates the indices for bowtie2, bwa, and samtools (and segemehl if you uncomment it). This workflow requires substantial computational resources due to the size of the human genome. The segemehl index creation is commented out due to its high memory consumption. Please make sure to only run it on well equipped machines.

Required tools:

download_human_gencode_release.yaml

Downloads the human Gencode main annotation v19 and a subset for long non-coding RNA genes. This workflow only downloads files from the internet and and thus should work on any machine.

Max. memory:depends on your machine
Disk usage:~1,2 GB
Run time:depends on your internet connection

Required tools:

Let’s have a look at the Mycoplasma genitalium example workflow by checking its uap_status:

$ cd <uap-path>/example-configurations/
$ uap index_mycoplasma_genitalium_ASM2732v1_genome.yaml status
[uap] Set log level to ERROR
[uap][ERROR]: index_mycoplasma_genitalium_ASM2732v1_genome.yaml: Destination path does not exist: genomes/bacteria/Mycoplasma_genitalium/

Oops, the destination_path does not exist (see destination_path Section). Create it and start again:

$ mkdir -p genomes/bacteria/Mycoplasma_genitalium/
$ uap index_mycoplasma_genitalium_ASM2732v1_genome.yaml status

Waiting tasks
-------------
[w] bowtie2_index/Mycoplasma_genitalium_index-download
[w] bwa_index/Mycoplasma_genitalium_index-download
[w] fasta_index/download
[w] segemehl_index/Mycoplasma_genitalium_genome-download

Ready tasks
-----------
[r] M_genitalium_genome/download

tasks: 5 total, 4 waiting, 1 ready

A list with all runs and their respective state should be displayed. A run is always in one of these states:

  • [r]eady
  • [w]aiting
  • [q]ueued
  • [e]xecuting
  • [f]inished

If the command still fails, please check that the tools defined in index_mycoplasma_genitalium_ASM2732v1_genome.yaml are available in your environment (see tools Section). If you really want to download and index the genome tell uap to start the workflow:

$ uap index_mycoplasma_genitalium_ASM2732v1_genome.yaml run-locally

uap should have created a symbolic link named index_mycoplasma_genitalium_ASM2732v1_genome.yaml-out pointing to the destination_path. The content should look something like that:

$ tree --charset=ascii
.
|-- bowtie2_index
|   |-- Mycoplasma_genitalium_index-download-cMQPtBxs
|   |   |-- Mycoplasma_genitalium_index-download.1.bt2
|   |   |-- Mycoplasma_genitalium_index-download.2.bt2
|   |   |-- Mycoplasma_genitalium_index-download.3.bt2
|   |   |-- Mycoplasma_genitalium_index-download.4.bt2
|   |   |-- Mycoplasma_genitalium_index-download.rev.1.bt2
|   |   `-- Mycoplasma_genitalium_index-download.rev.2.bt2
|   `-- Mycoplasma_genitalium_index-download-ZsvbSjtK
|       |-- Mycoplasma_genitalium_index-download.1.bt2
|       |-- Mycoplasma_genitalium_index-download.2.bt2
|       |-- Mycoplasma_genitalium_index-download.3.bt2
|       |-- Mycoplasma_genitalium_index-download.4.bt2
|       |-- Mycoplasma_genitalium_index-download.rev.1.bt2
|       `-- Mycoplasma_genitalium_index-download.rev.2.bt2
|-- bwa_index
|   `-- Mycoplasma_genitalium_index-download-XRyj5AnJ
|       |-- Mycoplasma_genitalium_index-download.amb
|       |-- Mycoplasma_genitalium_index-download.ann
|       |-- Mycoplasma_genitalium_index-download.bwt
|       |-- Mycoplasma_genitalium_index-download.pac
|       `-- Mycoplasma_genitalium_index-download.sa
|-- fasta_index
|   `-- download-HA439DGO
|       `-- Mycoplasma_genitalium.ASM2732v1.fa.fai
|-- M_genitalium_genome
|   `-- download-5dych7Xj
|-- Mycoplasma_genitalium.ASM2732v1.fa
|-- segemehl_index
|   |-- Mycoplasma_genitalium_genome-download-2UKxxupJ
|   |   |-- download-segemehl-generate-index-log.txt
|   |   `-- Mycoplasma_genitalium_genome-download.idx
|   `-- Mycoplasma_genitalium_genome-download-zgtEpQmV
|       |-- download-segemehl-generate-index-log.txt
|       `-- Mycoplasma_genitalium_genome-download.idx
`-- temp

Congratulation you’ve finished your first uap workflow!

Go on and try to run some more workflows. Most examples require the human genome so you might turn your head towards the index_homo_sapiens_hg19_genome.yaml workflow from her:

$ uap index_homo_sapiens_hg19_genome.yaml status
[uap] Set log level to ERROR
[uap][ERROR]: Output directory (genomes/animalia/chordata/mammalia/primates/homo_sapiens/hg19/chromosome_sizes) does not exist. Please create it.
$ mkdir -p genomes/animalia/chordata/mammalia/primates/homo_sapiens/hg19/chromosome_sizes
$ uap index_homo_sapiens_hg19_genome.yaml run-locally
<Analysis starts>

Again you need to create the output folder (you get the idea). Be aware that by default only the smallest chromosome, chromsome 21, is downloaded and indexed. This reduces required memory and computation time. You can uncomment the download steps for the other chromosomes and the index for the complete genome will be created.

Sequencing Data Analysis

Now that you possess the genome sequences, indices, and annotations let’s have a look at some example analysis.

General Steps

The analysis of high-throughput sequencing (HTS) data usually start with some basic steps.

  1. Conversion of the raw sequencing data to, most likely, fastq(.gz) files
  2. Removal of adapter sequences from the sequencing reads
  3. Alignment of the sequencing reads onto the reference genome

These basic steps can be followed up with a lot of different analysis steps. The following analysis examples illustrate how to perform the basic as well as some more specific steps.