OLego -- short or long RNA-seq read mapping to discover exon junction

Jie Wu (wuj@cshl.edu), Chaolin Zhang (czhang@rockefeller.edu)

Updated on Nov 19th, 2012

We are updating the manual frequently, so please check regularly for updates! 

What is OLego?
======================

OLego is a program specifically designed for de novo spliced mapping of mRNA-seq reads. OLego adopts a seed-and-extend scheme, and does not rely on a separate external mapper. It achieves high sensitivity of junction detection by strategic searches with very small seeds (12-14 nt), efficiently mapped using Burrows-Wheeler transform (BWT) and FM-index. This also makes it particularly sensitive for discovering small exons. It is implemented in C++ with full support of multiple threading, to allow for fast processing of large-scale data.

OLego is an open source code project and released under GPLv3. The implementation of OLego relies heavily on BWA (version 0.5.9rc1, http://bio-bwa.sourceforge.net/).  It also uses some source code from the Jim Kent source code tree (http://genome.ucsc.edu/admin/git.html). 

Versions
======================

v1.0.8 ( 11-20-2012 )
---------------------
* Improvement in hit clustering.
* Fixed an overcounting problem in mismatch counting.
* Fixed bug in merging step. 
* Fixed bug in XS tag for extra exon body reads.
* Allows pipe input/output with "-" for some of the scripts. 

v1.0.6 ( 08-09-2012 )
---------------------

* Added option --max-multi (default:1000) to avoid huge data in a single line.
* Added option --num-reads-batch.
* Fixed a bug in the junction connecting step.

v1.0.5 ( 07-16-2012 )
---------------------

* Minor bug fixed (the old code crashes in a very rare case).

v1.0.4 ( 06-12-2012 )
---------------------

* Option changes ( do single-anchor search by default now ). 

v1.0.3 ( 06-10-2012 )
---------------------

* Now supports strand specific library
* Fixed bugs about XS

v1.0.0 ( 05-15-2012 )
----------------------

* The initial Public release


Prerequisites
======================

The major programs of OLego ( olego and olegoindex ) can be installed and ran on Unix-based system (Linux or MacOS) with GCC compiler installed. We provided scripts for post analysis and regression model construction, these codes may require Perl and R. 

Download
======================

The code and binary files can be found at http://sourceforge.net/projects/ngs-olego/files/ , we are regularly updating the code, so please check regularly to keep your code updated. The code can also be reached via git::

	git clone git://git.code.sf.net/p/ngs-olego/code ngs-olego-code

The main programs of OLego (olego and olegoindex ) can be installed and run on Unix-based system with GCC compiler installed. We also provide scripts for post analysis and regression model construction. These codes may require Perl and R installed. 

Installation
======================

To compile OLego on your computer, please go to the OLego directory and type::

	make

If everything goes right, you will find two executable files olegoindex and olego in the folder.

We also provide binary executable files at http://sourceforge.net/projects/ngs-olego/files/ for x86_64 and i686 Linux systems. 

Please feel free to report any problems you come up with.


Usage
======================

Build the index for the genome sequence
----------------------------------------

To run OLego, you need a BWT index for the reference sequences. We use exactly the same genome index used by BWA.  For your convience, you can build the index with olegoindex that comes with this package::

	olegoindex [-a bwtsw|div|is] [-c] <in.fasta>

Arguments:

        ==========  ========================================================================
	Argument    Description
	==========  ========================================================================
	<in.fasta>  This is the fasta format file with the reference sequence. Please put 
		    all the sequences (from different chromosomes ) in a single file.
	==========  ========================================================================
	
Options:

    ========    ==================================================================	
    Option	Description
    ========    ==================================================================
    -a		BWT construction algorithm: bwtsw or is [default: bwtsw]
    -p		prefix of the index [default: the same as the fasta file name]
    ========    ==================================================================
	
Caution: please use "-a bwtsw" for long genome (like human or mouse genome).

There will be 8 files (prefix.pac, prefix.ann, prefix.amb, prefix.rpac, prefix.bwt, prefix.rbwt, prefix.sa, prefix.rsa ) generated after olegoindex finishes.

Running OLego
----------------------

Now you can map your mRNA-seq reads to the genome with olego::

	olego [options] <prefix> <in.fastx>

The arguments and options are decribed as below:

Arguments:
^^^^^^^^^^^^^^^^^^^^^^

    =========== =============================================================================================
    Argument	Description
    ===========	=============================================================================================	
    <prefix>	The prefix of the genome sequence index, including the path and the base name. 
    <in.fastx>	Either fasta or fastq file would work as input. Note that gzipped file is also accepted. 
		Addtionally, using "-" will make the program to read input from STDIN. 
    ===========	=============================================================================================

Basic options:
^^^^^^^^^^^^^^^^^^^^^^

    =====================   ===========================================================================
    Options		    Descriptions
    =====================   ===========================================================================
    -o,--output-file	    Name of the output file [ default: stdout ]. This file will be in SAM
			    format, with some customized tags. Please see the details of the file
			    format below. 
    -j,--junction-file	    Annotation file for known exon junctions. It is in BED format and please 
			    see the junc format description below. 
    -n,--non-denovo	    No de novo junction search. Note that if junction annotation file is 
			    provided by -j, these "known" junctions will still be searched. 
    -t,--num-threads	    Number of threads (INT) [ default: 1 ]. OLego fully supports multiple
			    threading, if you have multiple CPU cores on your computer, please 
			    specify the number of CPUs you want to use with this option. 
    -r,--regression-model   The file with the parameters for the logistic regression model. The mouse 
			    model will be used if no file is selected. The model file contains the 
			    parameters for the regression model (the coefficients, the PWM and the 
			    background ). We have provided model files for mouse and human. User-defined
			    model can also be generated with the regression_model_gen scripts for any 
			    species. Please see its usage below. 
    -M,--max-total-diff	    Maximum total difference between query read and reference sequence. 
			    Either INT or FLOAT number can be used for this option. An INT number 
			    will specify the maximum total edit distance allowed for each alignment.
			    A FLOAT number will specify the fraction of missing alignments given 2% 
			    uniform base error rate. This parameter is the same as -n in BWA. [default: 
			    a FLOAT number 0.04 ]
    -w,--word-size	    The size of the seed used in junction search (INT) [ default: 14 ]. 
			    The default seed size is recommended for reads >100 nt. For shorter
			    reads, a smaller number can be used. e.g., 12 nt for 36 nt reads. The 
			    seeds will be evenly distributed on the read from the start to the end,
			    so please try to cover the read as much as possible with a reasonable 
			    seed size. (13 nt for 36 nt reads is a BAD example. )
    -W,--max-word-occ	    Maximum number of matches of a seed (INT) [ default: 1000]. If a seed has
			    more than this number of hits on the genome, then it will be considerred 
			    repeptive and all of its hits will be discarded. 
    -m,--max-word-diff	    Maximum edit distance allowed for each seed (INT) [ default: 0 ]. 
			    Since our seed size is smaller than other programs, we recommend that 
			    the user use a small number for this option.  
    -I,--max-intron	    Maximum intron size for de novo junction search (INT) [ default: 500000 ].
    -i,--min-intron	    Minimum intron size for de novo junction search (INT) [ default: 20 ].
    -e,--min-exon	    Minimum micro-exon size to be searched (INT) [ default: 9 ].
    -a,--min-anchor	    Minimum anchor size in de novo single-anchor junction searches (INT) 
			    [ default: 8 ]. We define "anchor size" as the smaller number of matched 
			    nucleotides on the read at the end of the junction. 
    -k,--known-min-anchor   Minimum anchor size in single-anchor junction searches when the junction
			    is in the annotation file specified by -j (INT) [ default: 5 ].
    -v,--verbose	    Verbose mode [ default: false ].
    =====================   ===========================================================================

Advanced options:
^^^^^^^^^^^^^^^^^^^^^^^^

    ======================  =================================================================================
    Option		    Description
    ======================  =================================================================================
    --non-single-anchor	    Disable single-anchor de-novo junction search. [ default: enabled ].
    --strand-mode	    Strand mode (INT). This value should be selected from 1, 2 or 3. For strand 
			    specific RNA-seq data, please use 1 if the reads should be mapped to the FORWARD 
			    strand of the RNA, use 2 if the reads should be mapped to the REVERSE strand. If 
			    the library is not strand specific, please use 3 to allow mapping onto both 
			    strands. [ default: 3 ]
    --max-multi		    Maximum number of alignments reported for multiple mappers. [ default: 1000 ]
    --min-logistic-prob	    Minimum logistic probablity for an alignment, calculated with the splice sites
			    motif and intron size, in the range of [0,1) [ default: 0.50]. A higher number
			    means more stringent filter, we don't recommend using high value since more true
			    de novo junctions will be filtered out.  
    --max-overhang	    Maximum number of overhanging nucleotides allowed near the candidate exon 
			    boundary in junction searches (INT) [ default: 6 ]. After we extend the candidate 
			    exons, we search for splice sites in the overhanging regions around the candidate 
			    exon boundary. 
    --max-gapo		    Maximum number of gap opens (INT) [ default: 1 ].
    --max-gape		    Maximum number of gap extensions, -1 for disabling long gaps (INT)[ default: -1 ]. 
    --indel-end-skip	    In BWT querying, do not put an indel within this number towards the ends 
			    [ default: 5 ].
    --gape-max-occ	    Maximum occurrences for extending a long deletion in BWT querying [ default: 10 ]. 
    --penalty-mismatch	    Mismatch penalty for querying involving BWT [ default: 3 ].
    --penalty-gapo	    Gap open penalty for querying involving BWT [ default: 11 ]
    --penalty-gape	    Gap extension penalty for querying involving BWT [ default: 4 ]
    --log-gap		    log-scaled gap penalty for long deletions for querying involving BWT. 
    --num-reads-batch	    This number of reads will be loaded into the memory for processing in each batch.
			    [4*16**4 = 262144]
    --none-stop             non-iterative mode: search for all n-difference hits in the BWT query (slooow).
    ======================  =================================================================================

Other useful scripts
---------------------

mergePEsam.pl
^^^^^^^^^^^^^^^^^^^^^

This script can be used to merge SAM format mapping results from paired-end reads. The two ends will be merged according to their distances and orientation. The script requires the two ends come from the same chromosome with proper orientation and the distance between them smaller than the threshold specified by option -d.   

Usage::

	    perl mergePEsam.pl [options] <end1.sam> <end2.sam> <out.sam> 

Arguments:

    =========== ==========================================================================
    Argument	Description
    ===========	==========================================================================
    <end1.sam>	The SAM format output from one end of the reads. 
    <end2.sam>	The SAM format output from the other end of the reads. Please make sure 
		the same lines in end1.sam and end2.sam are corresponded (i.e. from the 
		same read pair ). 
    <out.sam>	The output file. In SAM format.
    ===========	==========================================================================
    
Options:

    ===========================  ==============================================================================
    Option                       Description
    ===========================  ==============================================================================
    -d	                         Maximum distance between the two ends on the reference [ default:5000000 ].
    --ss, --same-strand          Require the read-pair mapped to the same strand. By default, we require the
                                 two ends mapped to different strands, which is the case in strandard Illumina 
	                         RNA-seq data. 
    --ns, --no-strand            Do not use strand information as a filter. 
    --nci, --no-check-input      Do not check if the read names in the input files are matched. By default, the
	                         script will check if the read names from the two ends are similar to make sure
	                         the lines are correctly matched. Please use this option if your read names are
	                         in a uncomparable format.
    -v	                         Verbose.
    ===========================  ==============================================================================

xa2multi.pl
^^^^^^^^^^^^^^^^^^^^^

This script can be used to extract all the alignments after the tag "XA" in each line. The current version is from BWA package, with minor modification.  

Usage::
	
	perl xa2multi.pl in.sam >out.sam


sam2bed.pl
^^^^^^^^^^^^^^^^^^^^^

This script converts SAM format output from OLego into BED format file. Only the best alignment (major alignment) of each read will be used. 

Usage::
	
	perl sam2bed.pl [options] <in.sam> <out1.bed> [out2.bed]

Arguments:

    ======================  =============================================================================
    Argument		    Description
    ======================  =============================================================================
    <in.sam>		    The SAM input file from OLego. "-" can be used to input from STDIN.
    <out1.bed> [out2.bed]   Please specify two BED files if you want Paired-end reads output into
			    separate BED files, otherwise, all the reads will be output into out1.bed.
			    "-" can be used to output into STDOUT. 
    ======================  =============================================================================

Options:

    ====================  =============================================================================
    Option		    Description
    ====================  =============================================================================
    -u,--uniq		    Only output uniquely mapped reads. The script identifies unique reads
			    by the tag "XT:A:U".
    -r,--use-RNA-strand	    Use the strand of the RNA based on the XS tag. By default, this script uses
			    the strand of the read. 
    -v			    Verbose.
    ====================  =============================================================================    

Using this script to convert SAM outputs from other programs might cause problems! 


bed2junc.pl
^^^^^^^^^^^^^^^^^^^^^

This script can be used to retrieve unique junctions from BED format file. The number of supporting reads of each junction will be in the score (5th) column. The output file can be used as junction annotation file for OLego (option -j).

Usage::
	
	    perl bed2junc.pl <in.bed> <out>

Arguments:
    
    ==========  ====================================================================
    Argument	Description
    ==========	====================================================================
    <in.bed>	The input BED file with the mapping results. "-" can be used to input 
		from STDIN. 
    <out>	The output BED format file with the junctions. This file can be directly 
		used as junction annotation file for olego. "-" can be used to output
		into STDOUT. 
    ==========	====================================================================

regression_model_gen
^^^^^^^^^^^^^^^^^^^^^

This set of scripts can be used to generate the user-defined logistic regression model. 

Usage::

	    perl regression_model_gen/OLego_regression.pl [options]

Options:

	========    ==============================================================
	Option	    Description
	========    ==============================================================
        -g	    The location of the Fasta files downloaded from UCSC genome 
		    browser, the names of the Fasta files should be something 
		    like chr1.fa etc.
        -a	    BED format annotation files for the true transcripts. True 
		    junctions will be extracted from this file. 
        -o	    Output prefix [default: userdefined].
	========    ==============================================================

The model file will be generated in output_prefix.cache, the file name would be output_prefix.cfg. 
				

Additional notes
=====================

File formats
---------------------

SAM format
^^^^^^^^^^^^^^^^^^^^^

OLego outputs the alignments in SAM format (http://samtools.sourceforge.net). Its specification can be found on samtools' website. 

The following tags are used in OLego. Please pay attention to the X? tags, most of them were adopted from BWA:

    ===	===================================================
    Tag	Meaning
    === ===================================================
    NM	Edit distance
    MD	Mismatching positions/bases
    X0	Number of best hits
    X1	Number of suboptimal hits
    XM	Number of mismatches in the alignment
    XN	Number of 'N's in the reference
    XO	Number of gap opens
    XG	Number of gap extentions
    XT	Type: Unique/Repeat/N *
    XS	Strand of the RNA **
    XA	Alternative hits; format: (chr,pos,CIGAR,NM,XS;)
    === ===================================================
    
*   "Unique" or "Repeat" is determined by the number of best hits ( top hits with the same edit distance, X0 ), NOT the total number of hits. "N" means there are more than 10 'N's in the reference ( XN>10 ). 

**  For a junction read, the strand of the RNA is determined by the annotation if the junction is annotated, or by the splice signal if it's a novel junction. For exonic read, the strand can not be determined (a "." is assigned ). 

Addtional scripts have been provided in the package for processing OLego output: sam2bed.pl can be used for conversion from SAM to BED format; xa2multi.pl can extract alignments after XA tags; mergePEsam.pl can merge the two outputs from paired-end RNA-seq data. 

For general processing of SAM files, please check SAMTools.

junc format
^^^^^^^^^^^^^^^^^^^^^

OLego takes junction annotations in junc (BED) format.
    
    =============   ======  ===================================================
    Column	    Name    Description
    =============   ======  ===================================================
    1		    chrom   The name of the chromosome
    2		    start   The starting position of the junction (intron)
    3		    end	    The ending position of the junction (intron)
    4		    name    Name of the junction
    5		    score   This column is reserved as the score of the junction, 
			    the bed2junc.pl provided in the package will output 
			    evidence number in this column. 
    6		    strand  Strand of the junction
    =============   ======  ===================================================

The score column is not essential.

Examples
=====================

An example pipeline for strandard Illumina RNA-seq data::

    olegoindex -a bwtsw mm9.fa (all chromosomes combined)
    # build your BWT index

    olego -v -t 16 -o f.sam ~/mz-local/database/mm9/genome/olego/mm9.fa f.fa
    # do the mapping with 16 CPU cores, output to f.sam

    olego -v -t 16 -o r.sam ~/mz-local/database/mm9/genome/olego/mm9.fa r.fa
    # do the same thing for the other file

    mergePEsam.pl f.sam r.sam merge.sam
    # merge both ends into merge.sam

    sam2bed.pl --use-RNA-strand merge.sam merge.bed
    # convert the SAM file to BED file

    bed2junc.pl merge.bed merge.junc
    # find the junctions in the BED file

    olego -v -t 16 -o f.remap.sam -j merge.junc --non-denovo ~/mz-local/database/mm9/genome/olego/mm9.fa f.fa
    olego -v -t 16 -o r.remap.sam -j merge.junc --non-denovo ~/mz-local/database/mm9/genome/olego/mm9.fa r.fa
    # do a remapping to rescue more reads, no more de novo mapping here since we already used a junction annotation. 

Example commands for strand specific RNA-seq data::

    olego -v -t 16 -o f.sam --strand-mode 1 ~/mz-local/database/mm9/genome/olego/mm9.fa f.fa
    # when you know that the reads should be mapped onto the forward strand of the transcripts

    olego -v -t 16 -o r.sam --strand-mode 2 ~/mz-local/database/mm9/genome/olego/mm9.fa r.fa
    # the other end should be on the reverse strand according to the protocol
    
Some example commands using pipe to save space::

    samtools view -uSh merge.sam |samtools sort - merge.sort
    # this converts the sam output into sorted bam files to save space, and the bam file should be ready for downstream analysis (e.g. Cufflinks).
    
    samtools view merge.sort.bam | sam2bed.pl -r - - | bed2junc.pl - merge.junc
    # this will extract all junctions from the bam output without generating sam and bed files.

