rNA - randomized Numerical Aligner
mrNA - MPI randomized Numerical Aligner

http://iga-rNA.sourceforge.net

****** Installation ******

Briefly, the shell commands `./configure; make; make install' should configure, build, and install this package (`make install' must be execute by root if you don't changed `--prefix' options in `./configure').

The LIBBOOST > 1.40 are required. If MPI support is detected, the MPI versions mrNA will automatically built (it is experimental code!). If you don't want the MPI versions add `--disable-mpi' to `./configure' options.

rNA will be compiled without debugging informations and with the highest level of optimization by default. If you want to debug the code (e.g., the maintainer asks you to do something like this), add `--enable-maintainer-mode' to create `_debug' versions of the programs. 

****** Reference preparation ******

In order to align a set of reads against a reference genome we have to build the rNA index. In order to build the Hash table for the multi-fasta file ref.fasta we have to type the following command:

rNA --create --reference ref.rNA --fasta ref.fasta --bl BL --k K

This command will produce the file ref.rNA that will be used later in order to align reads. Parameters --bl and --k are fundamental and must be chosen carefully.

The value K of parameter --k determines the hash table size and the fingerprint computation. The has table has size q=4^{K}-1 and this is the same value used to compute the fingerprint based on the mod function. In order to achieve good performances q has to be larger than the reference size (at least the double). K is limited to 15.

The value of --bl parameter is also important for rNA's performances. It is the word length used to index the text and to search the reads. rNA will compute the hash value of each BL-word present in the reference. During the alignment of a read of length 100, the read will be divided into 100/BL non overlapping segments and each segment will be hashed and searched. The block is searched as presented in the paper: the fingerprint of the block is computed and all the positions of the reference represented by a fingerprint that could be at hamming distance at most two are checked. 

A large value of BL cause few false positives but each seed must be searched with an higher number of errors (i.e. read length 100, BL=50 allowed error in read 3, than the two seed must be searched with at least 1 error), on the other hand a small BL cause a larger number of false positive but allows to search each seed with 0 errors (i.e. read length 100, BL=25 allowed error in read 3, than at least one seed must occur without errors).

The following table shows, for different errors and for different reads lengths the best number of blocks (and hence the block length) in order to achieve optimal performances on a reference of length 500MB (grapevine genome)

	   pattern length		
errors 	75	100	150
0	 	1	1	1
1		2	2	2
2		2	3	3
3		2	3	4
4		3	3	4
5		3	3	4
6		3	4	4
7		3	4	4
8			3	5
9			3	5
10			3	4


The maximum number of errors allowed in the seed is two.



****** Alignment *****

rNA can be used to search for alignments only after having prepared the reference file. A typical use of rNA, for a single read lane, is:

rNA --search --reference ref.rNA --query1 s_1_sequence.txt --auto-errors --output out.sam

which aligns the file s_1_sequence.txt against the reference ref.rNA. In this case, the
read will be trimmed out before being aligned and the number of maximum errors is
chosen in relation to the length of the read after the trimming (default value:
1 error every 15bp). Maximum read length allowed is 300bp.

For a paired end read lane, the command could be:

rNA --search --reference ref.rNA --query1 s_1_1_sequence.txt --query2 s_1_2_sequence.txt --output out.sam

If the query files are compressed (in .gz or .bz2 format), the program will decompress the files on the
fly. If you want to produce as output a bam file, just
add the --bam parameter (remember to write "--output out.bam").

rNA automatically detects if the inputs are in ILLUMINA or in STANDARD fastq format.
If you want to override the default behaviour, use --force-illumina or
--force-standard parameters, respectively.

If you use rNA on a multiprocessor machine with N processors, you can define the
number of thread to use with --threads <N>.

If you want to change the default error ratio you can use --errors-rate <N>,
where N is the number of bp for each error (default 15). If you want to fix the
maximum number of errors allowed for each read, you can use the --errors E instead of
--auto-errors.

If you do not want to perform the trimming operation, just add the --no-auto-trim
parameter.

In PE  alignments the BAM "proper pair" flag is set to true if the two reads are in the same contig and they are facing (---> <---). If you specify the two parameters "--insert-size-min MIN" and "--insert-size-max MAX", the proper pair is set to true if the above condition AND if the insert size is between MIN and MAX. 

If you provide a (small) reference sequence (preprocessed) with
--contamination-reference parameters, the read will be first aligned against this reference,
before being aligned against the main reference. If a hit in the contamination sequence is found, the read
is marked as not found and the tag XC is set to remember that the read belongs to the contamination sequence and not to the reference.

If you want to allow small indels add the --indels options. You can change the maximum number of bp allowed with --indels-max <N> parameter (default: 5). More bp are allowed, more slower the process becomes.
 
With the option --print-all (available in this moment only for single reads), all the alignments will be printed in the SAM file. If you want to have a maximum number or record, use --print-first option. 


 ****** Read Filtering *****
Often once needs to obtain a set of filtered and trimmed reads. This is a mandatory step when once want to perform de-novo assembly
rNA's --filter option is designed in order to solve this problem. A tipical use of rNA with this option is:

 rNA --filter --reference contamination.rNA --query1  s_1_1_sequence.txt --query2 s_1_2_sequence.txt --output trimmed_1

rNA will align all the reads against the contamination reference, it will trim reads according to their quality and save the results into
trimmed_1_1.fastq
trimmed_1_2.fastq
trimmed_1_unpaired.fastq

If you use rNA on a multiprocessor machine with N processors, you can define the number of thread to use with --threads <N>. --reference parameters is optional after release 0.9.10.


 ****** Using on cluster (EXPERIMENTAL CODE) *****
If the MPI support is detected, the mrNA program is automatically build (the file mrNA should be present in the src directory).
In order to successfully use mrNA, the user must know the cluster composition (number of available nodes, number of cores per node).
We will provide some examples to how use mrNA over an OpenMPI system composed by 10 nodes (n=10) with 8 cores each (p=8).
First of all, a file named `hostfile` containing the names of the 10 nodes that we are going to use must be provided.

mrNA uses a slightly different reference structure than the one used by rNA. In particulaer, a set of n `.rNA' files with an additional header file
is generated. In order to do this on our system we have to provide the following command:

mpirun -pernode -hostfile hostfile mrNA --create --reference NAME --fasta /path/of/input.fasta

After the computation ends, we are ready to align the reads:

mpirun -pernode -hostfile hostfile mrNA --search --reference NAME --threads 8 --query1 /path/of/reads.txt [other options]

In some queue environments (like Torque or similar) you do not need to provide the hostfile. If you do not want to use multi-threading you can remove both -pernode and --threads options.  

mrNA accept exactly the same arguments of rNA except --query2, --indels, and --gap: the code is still experimental and at this moment only single read search with some limitations works.
We are working on PE alignment. 


