This mode is used to convert FASTA or FASTQ data into the Goby “compact-reads” file format. Input files compressed with gzip are handled directly by this mode. The mode supports associating meta-data about the reads in the compact-reads files it outputs. It is implemented by edu.cornell.med.icb.goby.modes.FastaToCompactMode.java.

Mode Parameters

The following options are available in this mode

FlagArgumentsRequiredDescription
(-d|--include-descriptions)n/anoWhen this switch is provided, include description lines into the compact output. By default, ignore description lines. Default value: FALSE
(-y|--include-identifiers)n/anoWhen this switch is provided, include identifiers into the compact output. By default, ignore identifiers. Identifiers are parsed out of description lines as the token before the first space or tab character. Default value: FALSE
(-f|--force)n/anoForce overwriting files that already exist. Default value: FALSE
--exclude-sequencesn/anoWhen this switch is provided, exclude sequences. This results in not writing sequences to the compact file. This can be useful to keep only an association between sequence index and identifier. Default value: FALSE
--exclude-qualityn/anoWhen this switch is provided, exclude quality scores. This results in not writing quality scores to the compact file. Default value: FALSE
--verbose-quality-scoresn/anoPrint quality scores to the console as they are read and converted to Phred score. Useful for testing with a small number of reads. Default value: FALSE
(-o|--output)outputnoIf there is only one read file, this will force the output file to this specific filename. If there is more than one input file, the output filename will always be the input filename without the .fasta, .gz, etc. extensions with an extension of .compact-reads. You should generally use an extension of .compact-reads when writing a compact reads file.
--quality-encodingquality-encodingnoThe encoding for quality scores. The default quality encoding is that used by the Illumina pipeline 1.3+. Valid encodings include Illumina, Sanger and Solexa. Default value: Illumina
(-n|--sequence-per-chunk)sequence-per-chunknoThe number of sequences that will be written in each compressed chunk. Default is suitable for very many short sequences. Reduce to a few sequences per chunk if each sequence is very large. Default value: 10000
n/ainputyesThe input fasta files to convert to compact reads. The output files will have the same filename but end in .compact-reads. If the input file ends in .gz it will be decompressed on the fly.
--paralleln/anoProcess input files in parallel. Use when you have many input files to process. You can tune the number of processors used by setting the property pj.nt. For instance, -Dpj.nt=5 will use 5 parallel threads. When –parallel is specified, one thread per processing core of the machine will be used unless specified otherwise (with pj.nt). Default value: FALSE
(-t|--num-threads)num-threadsnoThe number of threads to run with. When -1 defaults to the number of cores on the machine. This option is only active when –parallel is specified. Default value: -1
--paired-endn/anoIndicates a paired-end run. When this switch is provided, this mode will try to locate a pair input file for each input provided. When a pair input file can be found (see –pair-indicator argument, the pair sequence, and quality scores are loaded in the output compact file. The two paired input files must have exactly the same number of sequences, and the sequences must appear in the same order. Default value: FALSE
--pair-indicatorpair-indicatornoPair indicators are used to locate paired sequence input files. The indicator must be two string tokens separated by one coma. The first token should appear in the first input filename of a pair, while the second token should appear in the second filename of a pair. This mode will substitute the first token by the second to transform an input filename into a paired input filename. Default value: _1,_2
(-k|--key)keynoProvide the key for key/value pairs of meta data. Meta data will be stored in the first entry of the read file.
(-v|--value)valuenoProvide the value for key/value pairs of meta data. Meta data will be stored in the first entry of the read file. Values are matched to keys according to the order in which key value pairs appear in the argument list. For instance, -k 1 -v 2 -k a -v b will associate key 1 to value 2 and key a to value b.
--key-value-pairskey-value-pairsnoA file with key-value pairs, in the Java property format. (key=value, one per line). Any key/value pairs defined in this file will be overriden by key/value pairs defined on the command line when the same key is used.
--codeccodecnoThe name of a codec. When provided, the coded is used to compress reads.
--force-quality-encodingn/anoForce quality encoding values to be within the scale of the chosen encoding. Ignore out of bound errors. Default value: FALSE
--concatn/anoConvert and concatenate a number of input files to one output file. The output file must be named on the command line with the -o option. Default value: FALSE
-xdynamic-optionsnoSet a dynamic option, in the format classname:key=value. Classname is the the name of the class that exposes the option (short class name without package), key identifies the option to change and value is the new value for the option.