RECOUNT: Probabilistic Error Correction for Next Generation Sequencing Data

RECOUNT is distributed under the GNU General Public License, either 
version 3 of the License, or (at your option) any later version. For
details, see COPYING

*** Contents ***
1. Introduction
2. Requirements
3. Installation
4. Input format
5. Usage
6. Additional data pre-processing softwares for RECOUNT input
7. Scripts for working with FASTQ 
8. Clustering Reads by Length
9. Credits and Citation
10. Questions, Comments, Problems

*** 1. Introduction ***
RECOUNT is a software for estimating the true count of Illumina reads
based on a probabilistic model. 
RECOUNT uses the quality score provided by Illumina and the reads as its input. 
Typical application of this software is for transcriptome or 
metagenomic expression analysis.


*** 2. Requirement *** 
For illustration, To handle around 20 million reads of length 34, using 1 Hamming distance 
neighbors you will need at least 10 gigabytes of disk space. To install
the software you need a C++ compiler.


*** 3. Installation *** 
Type 'make' in the '/src' directory. It should make two groups of programs:
1) Components of RECOUNT: FindNeighboursWithQual, GenerateProportion and
EstimateTrueCount.
2) Additional preprocessing programs: AverageTagsQuals_LEN (where LEN is
the read length ranging from 17 upto 104 bp) and PickBaseQual


*** 4. Input format ***
RECOUNT takes a pre-processed data as input. It looks like this:

700218	AAA     40	40	40	
25078	AAC     40	40	3	
25010	AAG     40	40	3	
25315	AAT     40	40	3	
25045	ACA     40	3	40

First column is the observed/actual count of a read, second colum is the read, 
and third column to the end is the 'average' quality score of each bases 
in the corresponding read. 

Note that RECOUNT does not consider sequences with 'N's, hence
these sequences have to be removed.


*** 5. Usage ***

You can run recount by executing the wrapper written in Perl.
The command is simply:

   perl recount.pl [input] [no_of_neighbor_mismatch]

For example:

   perl recount.pl test-data.txt 1

Maximum allowable number of mismatch is 2. Note that the running time
and space requirement of RECOUNT using mismatch 2 is quadratic compare
to using mismatch 1. For 2 mismatches option, you may reduce the 
space and running time requirement by setting the error error probability
bar under a variable in recount.pl.


*** 6. Additional data pre-processing softwares ***
Illumina's pipeline generated two types of data: sequence (seq) and quality score
(prb). In this package we provide several types of  additional softwares. 

a. AverageTagsQuals_seq_prb takes the input of the
following format:

AAA  40 -40 -40 -40     40 -40 -40 -40     40 -40 -40 -40
AAA  30 -40 -40 -40     10 -40 -40 -40     20 -40 -40 -40
AAA  20 -40 -40 -40     20 -40 -40 -40     40 -40 -40 -40
AAC  40 -40 -40 -40     40 -40 -40 -40     -40 -40 -40 40

then compute the average of the quality score.
Note that the above input must be sorted by the reads ascendingly.
And the output is:

3 AAA  30 -40 -40 -40     23.33 -40 -40 -40     33.33 -40 -40 -40
1 AAC  40 -40 -40 -40     40 -40 -40 -40     -40 -40 -40 40

The command is simply:
    ./AverageTagsQuals [sorted_seq_prb_file] 


b. PickBaseQual takes the output of AverageTagsQuals above and 
produce the following result:

3 AAA 30 23.33 33.33
1 AAC 40 40 -40

The command is simply:
    ./PickBaseQual [average_prb_file]


Sometimes the data does not come in PRB format, but in FASTQ
where the each base in the tag has only one corresponding quality score.
For that reason we also provide another version of AverageTagsQuals_seq-prb
which only average out based on single quality value of each base in the tag.

c. AverageTagsQuals_LEN takes the input of the following format:

AAA  40 40 40 
AAA  30 10 20 
AAA  20 20 40 
AAC  40 40 40

We call this SEQ-QUAL format.

And the output is:
3 AAA  30  23.33 33.33 
1 AAC  40  40 40


*** 7. Scripts for Working with FASTQ  ***
Most of the time the dataset comes in FASTQ format. 
We also provided additional scripts for preparing such format 
as input for RECOUNT.
They are stored in scripts/ directory of this package.

a) For converting  FASTQ to SEQ-QUAL format for RECOUNT as input 

    perl fastq2seqprb.pl [fastq_file] > [SEQ_QUAL_file]


Remember, after conversion to SEQ-QUAL format, it needs to be 
average out using AverageTagsQuals_LEN (see point 6c) as RECOUNT final input format.
Here are the steps:

    ./AverageTagsQuals_LEN [SEQ-QUAL_file] > [SEQ-QUAL_avg_file]

    perl recount.pl [SEQ_QUAL_avg_file] [no_of_neighbor_mismatch]
    

b) Converting FASTQ to FASTA format. 

  ./fastq2fasta.sh [fastq_file]



*** 8. Clustering Reads By Length ***

Sometimes the dataset contain reads with different length.
RECOUNT can only deal with reads with equal length. We provide
a tool for this purpose. They are stored in scripts/ 
directory of this package.

After the FASTAQ format has been converted to SEQ-PRB format,
separation of the reads into several files according to its 
length can be done with the given script.

  perl cluster_seqprb_by_taglen.pl [seq-prb-file]

It will then create several files with name: [seq-prb-file-LEN] 

And next we need to create the average version of the reads

./AverageTagQuals_LEN [seq-prb-file-LEN]  > [seq-prb-file-LEN-avg]

Finally we can run RECOUNT at the final output 

perl recount.pl [seq-prb-file-LEN-avg]



*** 9. Credits and Citation *** 
RECOUNT is developed in C++ by Edward Wijaya in the 
Computational Biology Research Center (CBRC) - AIST.
The EM algorithm is based on (Beissbarth. et.al 
Bioinformatics (20),  i31-39, 2007). 


*** 10. Questions, Comments, Problems ***
Email: e-wijaya@aist.go.jp or p-horton@aist.go.jp
If reporting the problem, please describe exactly how to trigger
the problem.
