This page last changed on Mar 05, 2008 by ac.

Copyright (c) Illumina Inc. 2007. All rights reserved.

Author: Anthony J. Cox

Purpose

This page contains some notes how to do large scale alignments using ELAND.

("the software formerly known as IMPALA")

ELAND stands for E fficient L arge-Scale A lignment of N ucleotide D atabases.

ELAND searches a set of large DNA files for a large number of short DNA reads allowing up to 2 errors per match.

Here
"large" = human genome size and above; of course, it also works for small genomes
"large number" = at least 8 million on a PC with 1Gb of RAM
"short" = 32 bases or less

Compilation

Eland gets compiled automatically as part of the pipeline compilation.

All necessary source files now live in the Eland subfolder of the Pipeline CVS tree, with the exception of GlobalUtilities.h and GlobalUtilities.cpp (both also required by PhageAlign), which live in subfolder Misc.

Each read in your query set of reads must be of the same length and ELAND must be compiled to match. If you wanted a manual compilation, to do this go to Pipeline/Eland and type e.g. make eland_20 for reads of length 20 - eland_20 is the executable that you should run. Alternatively, set OLIGO_LENGTH in your environment to 20 and do make -e eland - in this case your executable will be called just eland.

Preparing the reference genome

Each read in your query set of reads must be of the same length. These reads are aligned to a target set of large DNA sequences. These sequences must be in Fasta format. An example:

>X dna:chromosome chromosome:NCBI36:X:1:154913754:1
CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCTGAAAGTGGACCTATCAGCAG
GATGTGGGTGGGAGCAGATTAGAGAATAAAAGCAGACTGCCTGAGCCAGCAGTGGCAACC
CAATGGGGTCCCTTTCCATACTGTGGAAGCTTCGTTCTTTCACTCTTTGCAATAAATCTT

They must first be "squashed" into the 2-bits-per-base format that ELAND reads. This is done by

  1. making a folder for the squashed files
    mkdir path/myGenome
    
  2. going to where the files are and doing
    Pipeline/Eland/squashGenome path/myGenome fastAFile1 [fastAFile2 ... ]
    

At the end of this procedure, your directory path/myGenome will look similar to this:

> ls path/myGenome
fastaFile1.fa.2bpb
fastaFile1.fa.vld

Until pipeline version 0.3, one was restricted to a single entry per fasta file for the target sequences. For pipeline version 0.3 this restriction has been removed. If fastaFile1 contained multiple entries then an additional file fastaFile1.fa.idx will be present in path/myGenome.

For speed you may want to make room for the squashed files on your machine's local hard drive rather than keeping them on an external analysis drive. As an example, the squashed files for the 3Gbase human genome will take up 750Mbytes of space (750 Mbytes x 4 bases per byte = 3 Gbases).

Limitations on the size of squashed genome files


This section describes limitations that one can come up against when squashing a large number of small sequences. Prior to version 0.3 it was necessary to place each sequence in a separate file, which meant one can hit the limits described below. However as of pipeline version 0.3, fasta files with multiple entries can be squashed, so the solution is just to place one's sequences in a single file.

The design goal of ELAND was to match a large number of reads against the human genome, ie a small number of large pieces of DNA. The key to doing this efficiently is to fit as large a batch of reads into RAM as possible. Accordingly it stores a match position (chromosome number plus position in chromosome) in 4 bytes, which in turn limits the total length of DNA that ELAND can match against in a single run.

The most significant of the 4 bytes is used as a block number, and can range between 1 and 239 (codes 0 and 240 to 255 are used to denote repeats). The remaining 3 bytes are used to denote position in a block, which can therefore be up to 2^24=16Mb in size. Each file must take up a whole number of blocks.

So in a single ELAND run you can match against:

  • One file of 240x16=3824Mb
  • 239 files, each up to 16Mb in size
  • Or something in between, eg 24 files of up to 160Mb each.
    The NCBI human genome will fit. Please note that at this time (4th April 2007) ELAND does not check that the directory of squashed files it is matching against does not exceed these limits.

Running ELAND as a standalone program

This has changed for 0.4

The eland_extended and eland_pair modes introduced in Pipeline version 0.3 share a common 'export' file format. The intention of this file is to combine all the important information for a lane into one file (or two files in the case of a paired read run). These scripts:

  1. Bring in the base quality value information
  2. Use the base quality values to pick the best alignment (taking into account read-pairing if relevant)
  3. Give a quality score to the alignment
    You do not get these files just by running the ELAND executable, the export files are generated by running several scripts to postprocess the raw ELAND output.

There seems to be an increasingly common requirement to run this set of scripts in a 'standalone' fashion. This can be done via the script Pipeline/Eland/ELAND_standalone.pl. Minimal usage:

./Pipeline/Eland/ELAND_standalone.pl -if read1.fastq -if read2.fastq -it fastq
-eg /lustre/data01/Mondas_software/Genomes/E_coli_ELAND

Full list of options:

option short form Description
--input-file -if must specify at least one file, specify two for paired analysis
--input-type -it type of input file: fastq, fasta or export
--read-length -rl will work it out from input file(s) if not specified
--seed-length -sl length of read used for ELAND alignment - default is to min of read-length and 32
--eland-genome -eg full path of a squashed genome directory
--output-prefix -op will produce a set of output files whose names prefixed by this - defaults to 'reanalysis'
--pipeline-dir -pd path of pipeline installation to call - by default, looks in same dir as executable is found
--base-quality -bq in fasta mode assume all bases have this quality (TBD)
--pair-params -pp paired read analysis parameters to pass to pickBestPair - multiple arguments must go in quotes - defaults to '--circular' - treats all chromosomes as circular

NB: for a paired read analysis, both reads must share same
input-type, read-length and seed-length

TBDs:

  • As of today (5.3.8), -it fasta doesn't work. You'll have to fake a fastq file with a constant base quality value of your choice.
  • I want to add in a partial reanalysis capability to do things like varying the filtering without having to do the whole realignment.
  • It doesn't do quality value recalibration at the moment (5.3.8), because that code is subject to change. It will do 'soon.'
  • For a paired-read run, the ELAND alignment of the two reads could be run in parallel to halve the time to result (assuming you have at least 2 cores free, as even your iPod probably does these days). It doesn't do this yet, it could.

ELAND_standalone is meant for:

  • Quickly testing effect of different filter parameters
  • Quickly testing alignment targets (e.g. align to genome + contaminants to check for contaminants).
  • Testing applications that read export files.
  • Quickly testing pipeline components (for internal use)

If you want to e.g. produce a set of export files for submission then it is probably best to re-run GERALD part of the pipeline to get a 'proper' analysis.

Pitfall:
It uses 'our' fastq encoding scheme (ASCII character = quality value + 64) rather than the more standard ASCII character = quality value + 32

A quick and dirty conversion from 'rest of the world' to 'us' can be done by

 
cat file.fastq | perl -lane '{ if ($l=~/^\+/) { s/(.)/chr(ord($1)+32)/eg;} print; $l=$_ }' > new.fastq

I propose to make ELAND_standalone.pl part of the 0.4 release as several of our informed external users have asked for it and it can't break the existing pipeline code to do so.

Running the ELAND executable

Are you sure you want to do this?

Prior to Pipeline 0.4, this was the only way to run ELAND without re-running the GERALD makefile generator. However, this just gives the raw match output for all sequences. It does not take account of quality-filtering nor of base quality values. You probably want to run the ELAND_standalone.pl script described above.

The command line syntax to run ELAND is

eland_executable queryFile.txt squashedGenomeDir outFile.txt [repeatFile.txt]
  • queryFile.txt is a file of query sequences. This must be either a multi-entry Fasta format file or a one-sequence-per-line ASCII file. The length of each sequence must exceed the read length specified at compilation, else "undefined behaviour" will result. Unspecified bases in the reads must be denoted by an "N." IUPAC ambiguity codes are not handled (yet).
  • squashedGenomeDir is the path to the directory squashed genome files you made above.
  • outFile.txt is the name of the output file. One line of output per sequence, fields are:
    1. Sequence name (derived from file name and line number if format is not Fasta)
    2. Sequence
    3. Type of match:
    NM - no match found.
    QC - no matching done: QC failure (too many Ns basically).
    RM - no matching done: repeat masked (may be seen if repeatFile.txt was specified).
    U0 - Best match found was a unique exact match.
    U1 - Best match found was a unique 1-error match.
    U2 - Best match found was a unique 2-error match.
    R0 - Multiple exact matches found.
    R1 - Multiple 1-error matches found, no exact matches.
    R2 - Multiple 2-error matches found, no exact or 1-error matches.
    4. Number of exact matches found.
    5. Number of 1-error matches found.
    6. Number of 2-error matches found.
    Rest of fields are only seen if a unique best match was found (i.e. the match code in field 3 begins with "U").
    7. Genome file in which match was found.
    8. Position of match (bases in file are numbered starting at 1).
    9. Direction of match (F=forward strand, R=reverse).
    10. How N characters in read were interpreted: ("."=not applicable, "D"=deletion, "I"=insertion).
    Rest of fields are only seen in the case of a unique inexact match (i.e. the match code was U1 or U2).
    11. Position and type of first substitution error (e.g. 12A: base 12 was A, not whatever is was in read).
    12. Position and type of first substitution error, as above.

For historical reasons a slightly different file format is used if an output name ending in .vmf is specified.

  • repeatFile.txt: Optionally you may want to specify a set of words that you know are highly repetitive in your target files at your read length of interest. You can then tell ELAND to ignore them, which greatly speeds up whole-human-genome alignments. There is no automatic way of generating a repeat file, but with a bit of Perl/shell scripting it is straightforward to extract a list of repeats from the output of a couple of ELAND runs to speed up future runs. You must have one sequence per line in alphabetical order. It does not matter if there are multiple identical sequences or if their length exceeds the read length, it does matter if any of them are shorter than the read length. Also note that you can't just eg stick 200 base ALu sequences in there, it will only read the first 32 of each - you need to split them into overlapping k-mers.
  • Also in the Eland directory is a basic test harness ELAND_test.pl which can be run to verify correct operation.

Running ELAND from the pipeline using GERALD

This requires revision v1.12 or later of GERALD.pl in Pipeline/Gerald. For more information on Gerald see Gerald User Guide and FAQ.

The config file you run GERALD on should contain lines such as

ELAND_GENOME /usr/local/share/eland/ncbi35

(or wherever your directory of squashed genome files is held)

ELAND_REPEAT /usr/local/share/eland/reps30_5

(or wherever your list of repeats is held)
and either

ANALYSIS eland

or use

34:ANALYSIS eland

to run ELAND on particular lanes.

Note that ELAND_GENOME referes to a directory, not a file. The usual Gerald variables GENOME_DIR and GENOME_FILE are not used for Eland analysis. As described above, Eland expects a different file format and not Fasta.

You should then be able to run make. Under the hood a script convertToFasta.pl converts and concatenates all the reads into a single large FASTA file which is then passed as an input to ELAND. A script convertFromELAND.pl converts the results back into the by-tile PhageAlign format expected by the rest of the pipeline.

  • For longer runs you will need to bear in mind that at the moment ELAND can cope with read lengths of at most 32.
  • Also note that at the moment you can only specify one ELAND_GENOME per experiment. This is a GERALD not an ELAND limitation but I'm unlikely to get round to fixing it in the near future.
  • CAUTION: ELAND will only detect matches having 2 differences or less from your reads. This means it is less sensitive than PhageAlign, which will always find a best match (although possibly not a unique one) for your reads. This is the price you pay for it being ~10000 times faster than PhageAlign. This means
    • If your data is noisy not all of it is going to align. If this happens to a significant proportion of your data then it is probably the case that your data is too noisy to get good results anyway. "Talk to the hand," as they say.
    • Estimates of error rates based on ELAND output will underestimate the true error rate (since the noisiest reads will not get aligned and hence won't contribute to the calculation).

Eland uses up to 1GB of memory. The pipeline starts one Eland job per lane. In order to prevent most computers from running out of memory, an artificial dependency in the GERALD Makefile prevents multiple instances of Eland from running at the same time. This limitation can be removed by using the option

ELAND_MULTIPLE_INSTANCES 8

in the Gerald config file. Be aware that this may use up to 8 GB of memory on your analysis computer; if not enough memory is available, the analysis is likely to crash. Allowed values for this option are 1, 2, 4, 8. (1 means no lane parallelisation, and uses up to 1 GB RAM, 2 means two parallel jobs and up to 2 GB etc.)

ELAND features, "features" and TBDs

  • Would like to process longer read lengths than 32
  • Quality value aware: at the moment it allows up to two mismatches per read, giving all mismatches equal weight. It would be better to be able to individually weight bases via either a score file or quality values, as is done by PhageAlign. This is on my to-do list: I've thought of a way round it but haven't had time to implement it.
  • More output format options, e.g. print all matches found, not just the unique best match.
  • I also have prototype versions to align paired reads and align bisulphite-treated DNA (to detect methylation sites).

Appendix: How ELAND handles missing bases


I have left the text of this appendix unchanged for the information of those who are interested in details, note that for pipeline version 0.3 the interpretation of 'type I' Ns is switched off by default - as I threaten to do below. If you want to switch it back on, uncomment the '#ifdef ALLOW_N_TO_BE_DELETION' near the beginning of ELAND_outer.h and recompile.

First note that missing bases need to be specified as "N" characters and not "." as in the sequence files. This conversion is handled automatically by GERALD but you need to be aware of it when running ELAND as a standalone program.

ELAND allows up to (read length/4) "external" Ns - i.e at the beginning or end of the read. These are ignored.

ELAND also allows up to 2 "internal" Ns - these are interpreted in one of two ways, which are given equal weight (i.e. if a read with one type 'D' match and one deletion-N match will be called as a repeat).

A 'type D' N is of the form:

Read:   ACNGT
Genome: ACCGT

i.e. the base is there but not detected.

A 'type I' N is of the form:

Read:   ACNGT
Genome: AC-GT

i.e. it has skipped a base.

'Type D' and 'type I' Ns are given equal weight

In lining up bases in your read with bases at the position it aligns to in the reference, you should
a. ignore any leading Ns
b. ignore any 'type I' Ns (because they are 'non existent' bases)

I do have a Perl script that correctly interprets the rules described above, but given that Ns shouldn't occur that often it might be simpler just to ignore N-containing reads.

For ancient historical reasons 'D' stands for 'detection error' and 'I' stands for 'incorporation error' - I realise this is confusing as the temptation is to interpret them as 'deletion' and 'insertion.' I spent considerable time trying to get the N handling to work as described above, but the way it handles Ns is based on very old assumptions about how the (then non existent!) machine would behave - at the time we were intending to sequence single molecules for which I assumed N errors would be a lot more common and would be likely to be of the type described. Most Ns we actually see in practice are due to clusters at the edge of tiles wandering off the edge of the image for a cycle or two due to imperfect re-mapping of the tile position at different cycles - hence giving a 'type D' error. Otherwise the pipeline software strongly prefers to make a base call if at all possible, even if the call is of low quality.

A low priority TBD is to allow the N handler to be switched to 'type D errors only' (which better reflects the way the platform behaves) or just restricted to external Ns.

Document generated by Confluence on Jun 27, 2008 17:25