Ray man page

Ray — assemble genomes in parallel using the message-passing interface

Synopsis

      mpiexec -n NUMBER_OF_RANKS Ray -k KMERLENGTH -p l1_1.fastq l1_2.fastq -p l2_1.fastq l2_2.fastq -o test

      mpiexec -n NUMBER_OF_RANKS Ray Ray.conf # with commands in a file

Description

 The Ray genome assembler is built on top of the RayPlatform, a generic plugin-based
 distributed and parallel compute engine that uses the message-passing interface
 for passing messages.

 Ray targets several applications:

   - de novo genome assembly (with Ray vanilla)
   - de novo meta-genome assembly (with Ray Méta)
   - de novo transcriptome assembly (works, but not tested a lot)
   - quantification of contig abundances
   - quantification of microbiome consortia members (with Ray Communities)
   - quantification of transcript expression
   - taxonomy profiling of samples (with Ray Communities)
   - gene ontology profiling of samples (with Ray Ontologies)

Options

      -help

Displays this help page.

      -version

Displays Ray version and compilation options.

 Using a configuration file

   Ray can be launched with
   mpiexec -n 16 Ray Ray.conf
   The configuration file can include comments (starting with #).

 K-mer length

      -k kmerLength

Selects the length of k-mers. The default value is 21.
             It must be odd because reverse-complement vertices are stored together.
             The maximum length is defined at compilation by MAXKMERLENGTH
             Larger k-mers utilise more memory.

 Inputs

      -p leftSequenceFile rightSequenceFile [averageOuterDistance standardDeviation]

Provides two files containing paired-end reads.
             averageOuterDistance and standardDeviation are automatically computed if not provided.

      -i interleavedSequenceFile [averageOuterDistance standardDeviation]

Provides one file containing interleaved paired-end reads.
             averageOuterDistance and standardDeviation are automatically computed if not provided.

      -s sequenceFile

Provides a file containing single-end reads.

 Outputs

      -o outputDirectory

Specifies the directory for outputted files. Default is RayOutput

 Assembly options (defaults work well)

      -disable-recycling

Disables read recycling during the assembly
             reads will be set free in 3 cases:
             1. the distance did not match for a pair
             2. the read has not met its mate
             3. the library population indicates a wrong placement
             see Constrained traversal of repeats with paired sequences.
             Sébastien Boisvert, Élénie Godzaridis, François Laviolette & Jacques Corbeil.
             First Annual RECOMB Satellite Workshop on Massively Parallel Sequencing, March 26-27 2011, Vancouver, BC, Canada.

      -disable-scaffolder

Disables the scaffolder.

      -minimum-contig-length minimumContigLength

Changes the minimum contig length, default is 100 nucleotides

      -color-space

Runs in color-space
             Needs csfasta files. Activated automatically if csfasta files are provided.

      -use-maximum-seed-coverage maximumSeedCoverageDepth

Ignores any seed with a coverage depth above this threshold.
             The default is 4294967295.

      -use-minimum-seed-coverage minimumSeedCoverageDepth

Sets the minimum seed coverage depth.
             Any path with a coverage depth lower than this will be discarded. The default is 0.

 Distributed storage engine (all these values are for each MPI rank)

      -bloom-filter-bits bits

Sets the number of bits for the Bloom filter
             Default is 268435456 bits, 0 bits disables the Bloom filter.

      -hash-table-buckets buckets

Sets the initial number of buckets. Must be a power of 2 !
             Default value: 268435456

      -hash-table-buckets-per-group buckets

Sets the number of buckets per group for sparse storage
             Default value: 64, Must be between >=1 and <= 64

      -hash-table-load-factor-threshold threshold

Sets the load factor threshold for real-time resizing
             Default value: 0.75, must be >= 0.5 and < 1

      -hash-table-verbosity

Activates verbosity for the distributed storage engine

 Biological abundances

      -search searchDirectory

Provides a directory containing fasta files to be searched in the de Bruijn graph.
             Biological abundances will be written to RayOutput/BiologicalAbundances
             See Documentation/BiologicalAbundances.txt

      -one-color-per-file

Sets one color per file instead of one per sequence.
             By default, each sequence in each file has a different color.
             For files with large numbers of sequences, using one single color per file may be more efficient.

 Taxonomic profiling with colored de Bruijn graphs

      -with-taxonomy Genome-to-Taxon.tsv TreeOfLife-Edges.tsv Taxon-Names.tsv

Provides a taxonomy.
             Computes and writes detailed taxonomic profiles.
             See Documentation/Taxonomy.txt for details.

      -gene-ontology OntologyTerms.txt  Annotations.txt

Provides an ontology and annotations.
             OntologyTerms.txt is fetched from http://geneontology.org
             Annotations.txt is a 2-column file (EMBL_CDS handle & gene ontology identifier)
             See Documentation/GeneOntology.txt
 Other outputs

      -enable-neighbourhoods

Computes contig neighborhoods in the de Bruijn graph
             Output file: RayOutput/NeighbourhoodRelations.txt

      -amos

Writes the AMOS file called RayOutput/AMOS.afg
             An AMOS file contains read positions on contigs.
             Can be opened with software with graphical user interface.

      -write-kmers

Writes k-mer graph to RayOutput/kmers.txt
             The resulting file is not utilised by Ray.
             The resulting file is very large.

      -write-read-markers

Writes read markers to disk.

      -write-seeds

Writes seed DNA sequences to RayOutput/Rank<rank>.RaySeeds.fasta

      -write-extensions

Writes extension DNA sequences to RayOutput/Rank<rank>.RayExtensions.fasta

      -write-contig-paths

Writes contig paths with coverage values
             to RayOutput/Rank<rank>.RayContigPaths.txt

      -write-marker-summary

Writes marker statistics.

 Memory usage

      -show-memory-usage

Shows memory usage. Data is fetched from /proc on GNU/Linux
             Needs __linux__

      -show-memory-allocations

Shows memory allocation events

 Algorithm verbosity

      -show-extension-choice

Shows the choice made (with other choices) during the extension.

      -show-ending-context

Shows the ending context of each extension.
             Shows the children of the vertex where extension was too difficult.

      -show-distance-summary

Shows summary of outer distances used for an extension path.

      -show-consensus

Shows the consensus when a choice is done.

 Checkpointing

      -write-checkpoints checkpointDirectory

Write checkpoint files

      -read-checkpoints checkpointDirectory

Read checkpoint files

      -read-write-checkpoints checkpointDirectory

Read and write checkpoint files

 Message routing for large number of cores

      -route-messages

Enables the Ray message router. Disabled by default.
             Messages will be routed accordingly so that any rank can communicate directly with only a few others.
             Without -route-messages, any rank can communicate directly with any other rank.
             Files generated: Routing/Connections.txt, Routing/Routes.txt and Routing/RelayEvents.txt
             and Routing/Summary.txt

      -connection-type type

Sets the connection type for routes.
             Accepted values are debruijn, hypercube, polytope, group, random, kautz and complete. Default is debruijn.
              debruijn: a full de Bruijn graph a given alphabet and diameter
              hypercube: a hypercube, alphabet is {0,1} and the vertices is a power of 2
              polytope: a convex regular polytope, alphabet is {0,1,...,B-1} and the vertices is a power of B
              group: silly model where one representative per group can communicate with outsiders
              random: Erdős-Rényi model
              kautz: a full de Kautz graph, which is a subgraph of a de Bruijn graph
              complete: a full graph with all the possible connections
             With the type debruijn, the number of ranks must be a power of something.
             Examples: 256 = 16*16, 512=8*8*8, 49=7*7, and so on.
             Otherwise, don't use debruijn routing but use another one
             With the type kautz, the number of ranks n must be n=(k+1)*k^(d-1) for some k and d

      -routing-graph-degree degree

Specifies the outgoing degree for the routing graph.
             See Documentation/Routing.txt

 Hardware testing

      -test-network-only

Tests the network and returns.

      -write-network-test-raw-data

Writes one additional file per rank detailing the network test.

      -exchanges NumberOfExchanges

Sets the number of exchanges

      -disable-network-test

Skips the network test.

 Debugging

      -verify-message-integrity

Checks message data reliability for any non-empty message.
             add '-D CONFIG_SSE_4_2' in the Makefile to use hardware instruction (SSE 4.2)

      -run-profiler

Runs the profiler as the code runs. By default, only show granularity warnings.
             Running the profiler increases running times.

      -with-profiler-details

Shows number of messages sent and received in each methods during in each time slices (epochs). Needs -run-profiler.

      -show-communication-events

Shows all messages sent and received.

      -show-read-placement

Shows read placement in the graph during the extension.

      -debug-bubbles

Debugs bubble code.
             Bubbles can be due to heterozygous sites or sequencing errors or other (unknown) events

      -debug-seeds

Debugs seed code.
             Seeds are paths in the graph that are likely unique.

      -debug-fusions

Debugs fusion code.

      -debug-scaffolder

Debug the scaffolder.

Files

 Input files

    Note: file format is determined with file extension.

    .fasta
    .fasta.gz (needs HAVE_LIBZ=y at compilation)
    .fasta.bz2 (needs HAVE_LIBBZ2=y at compilation)
    .fastq
    .fastq.gz (needs HAVE_LIBZ=y at compilation)
    .fastq.bz2 (needs HAVE_LIBBZ2=y at compilation)
    .sff (paired reads must be extracted manually)
    .csfasta (color-space reads)

 Outputted files

 Scaffolds

    RayOutput/Scaffolds.fasta
    The scaffold sequences in FASTA format
    RayOutput/ScaffoldComponents.txt
    The components of each scaffold
    RayOutput/ScaffoldLengths.txt
    The length of each scaffold
    RayOutput/ScaffoldLinks.txt
    Scaffold links

 Contigs

    RayOutput/Contigs.fasta
    Contiguous sequences in FASTA format
    RayOutput/ContigLengths.txt
    The lengths of contiguous sequences

 Summary

    RayOutput/OutputNumbers.txt
    Overall numbers for the assembly

 de Bruijn graph

    RayOutput/CoverageDistribution.txt
    The distribution of coverage values
    RayOutput/CoverageDistributionAnalysis.txt
    Analysis of the coverage distribution
    RayOutput/degreeDistribution.txt
    Distribution of ingoing and outgoing degrees
    RayOutput/kmers.txt
    k-mer graph, required option: -write-kmers
        The resulting file is not utilised by Ray.
        The resulting file is very large.

 Assembly steps

    RayOutput/SeedLengthDistribution.txt
        Distribution of seed length
    RayOutput/Rank<rank>.OptimalReadMarkers.txt
        Read markers.
    RayOutput/Rank<rank>.RaySeeds.fasta
        Seed DNA sequences, required option: -write-seeds
    RayOutput/Rank<rank>.RayExtensions.fasta
        Extension DNA sequences, required option: -write-extensions
    RayOutput/Rank<rank>.RayContigPaths.txt
        Contig paths with coverage values, required option: -write-contig-paths

 Paired reads

    RayOutput/LibraryStatistics.txt
    Estimation of outer distances for paired reads
    RayOutput/Library<LibraryNumber>.txt
        Frequencies for observed outer distances (insert size + read lengths)

 Partition

    RayOutput/NumberOfSequences.txt
        Number of reads in each file
    RayOutput/SequencePartition.txt
    Sequence partition

 Ray software

    RayOutput/RayVersion.txt
    The version of Ray
    RayOutput/RayCommand.txt
    The exact same command provided

 AMOS

    RayOutput/AMOS.afg
    Assembly representation in AMOS format, required option: -amos

 Communication

    RayOutput/MessagePassingInterface.txt     Number of messages sent
    RayOutput/NetworkTest.txt     Latencies in microseconds
    RayOutput/Rank<rank>NetworkTestData.txt     Network test raw data

Documentation

      - mpiexec -n 1 Ray -help|less (always up-to-date)
      - This help page (always up-to-date)
      - The directory Documentation/
      - Manual (Portable Document Format): InstructionManual.tex (in Documentation)
      - Mailing list archives: http://sourceforge.net/mailarchive/forum.php?forum_name=denovoassembler-users

Author

      Written by Sébastien Boisvert.

Reporting Bugs

      Report bugs to denovoassembler-users@lists.sourceforge.net
      Home page: <http://denovoassembler.sourceforge.net/>

Info

November 2012 Ray 2.1.0 User Commands