Friday, January 23, 2015

Compilation function of each BBmap shell scripts:

addadapters.sh
Randomly adds adapters to a file, or grades a trimmed file.

bbcountunique.sh
Generates a kmer uniqueness histogram, binned by file position.
There are 3 columns for single reads, 6 columns for paired:
count      number of reads or pairs processed
r1_first   percent unique 1st kmer of read 1
r1_rand    percent unique random kmer of read 1
r2_first   percent unique 1st kmer of read 2
r2_rand    percent unique random kmer of read 2
pair       percent unique concatenated kmer from read 1 and 2

bbduk.sh
Compares reads to the kmers in a reference dataset, optionally 
allowing an edit distance. Splits the reads into two outputs - those that 
match the reference, and those that don't. Can also trim (remove) the matching 
parts of the reads rather than binning the reads. Can use for quality trimming.

bbduk2.sh
Compares reads to the kmers in a reference dataset, optionally 
allowing an edit distance. Splits the reads into two outputs - those that 
match the reference, and those that don't. Can also trim (remove) the matching 
parts of the reads rather than binning the reads.

bbest.sh
Calculates EST (expressed sequence tags) capture by an assembly from a sam file.
Designed to use BBMap output generated with these flags: k=13 maxindel=100000 custom tag ordered

bbfakereads.sh
Generates fake read pairs from ends of contigs or single reads.

bbmap.sh
Fast and accurate short-read aligner for DNA and RNA.

bbmapskimmer.sh
Fast and accurate short-read aligner for DNA and RNA. (not sure difference from above)

bbmask.sh
Masks sequences of low-complexity, or containing repeat kmers, or covered by mapped reads.

bbmerge.sh
Merges paired reads into single reads by overlap detection.
With sufficient coverage, can also merge nonoverlapping reads using gapped kmers.

bbmergegapped.sh
Merges paired reads into single reads by overlap detection.
With sufficient coverage, can also merge nonoverlapping reads using gapped kmers.

bbnorm.sh
Normalizes read depth based on kmer counts.
Can also error-correct, bin reads by kmer depth, and generate a kmer depth histogram.

bbqc.sh
Performs quality-trimming; artifact, human, and phiX removal; adapter-trimming; error-correction and normalization. Designed for Illumina fragment libraries only.

bbrename.sh
Renames reads to <prefix>_<number> where you specify the prefix and the numbers are ordered.

bbsplit.sh
Maps reads to multiple references simultaneously.
Outputs reads to a file for the reference they best match, with multiple options for dealing with ambiguous mappings.

bbsplitpairs.sh
Separates paired reads into files of 'good' pairs and 'good' singletons by removing 'bad' reads that are shorter than a min length.
Designed to handle situations where reads become too short to be useful after trimming.  This program also optionally performs quality trimming.

bbwrap.sh
Wrapper for BBMap to allow multiple input and output files for the same reference.

calcmem.sh
????

calctruequality.sh
Calculates the observed quality scores from a sam file.

callpeaks.sh
No description--but appears related to keeping only certain reads within a certain region of histogram.

countbarcodes.sh
Counts the number of reads with each barcode.

countgc.sh
Counts GC content of reads or scaffolds.

crosscontaminate.sh
Generates synthetic cross-contaminated files from clean files.
Intended for use with synthetic reads generated by RandomReads.

cutprimers.sh
Cuts out sequences corresponding to primers identified in sam files.

decontaminate.sh
Decontaminates multiplexed assemblies via normalization and mapping.

dedupe.sh
Accepts one or more files containing sets of sequences (reads or scaffolds).
Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity. Can also find overlapping sequences and group them into clusters.

dedupe2.sh
Accepts one or more files containing sets of sequences (reads or scaffolds).
Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity.Can also find overlapping sequences and group them into clusters. (Not sure difference between dedupe.sh)

demuxbyname.sh
Demultiplexes reads based on their name (suffix or prefix) into multiple files.

ecc.sh
Corrects substitution errors in reads using kmer depth information.
Can also normalize and/or bin reads by kmer depth.

filterbarcodes.sh
Filters barcodes by quality, and generates quality histograms.

filterbycoverage.sh
Filters an assembly by contig coverage.

filterbyname.sh
Filters reads by name.

getreads.sh
Gets reads by number. The first read (or pair) has ID 0, the second read (or pair) has ID 1, etc.

grademerge.sh
Grades correctness of merging synthetic reads with headers generated by RandomReads and re-headered by RenameReads

gradesam.sh
Grades mapping correctness of a sam file of synthetic reads with headers generated by RandomReads3.java

idmatrix.sh
Generates an identity matrix via all-to-all alignment.

khist.sh
Generates a histogram of kmer counts for the input reads or assemblies.
Can also normalize, error-correct, and/or bin reads by kmer depth.

kmercount.sh
--counts kmers?

kmercountexact.sh
Counts the number of unique kmers in a file.

makechimeras.sh
Makes chimeric PacBio reads from nonchimeric reads.

mapPacBio.sh
Fast and accurate short-read aligner for DNA and RNA.

mapPacBio8k.sh
Description:  Fast and accurate short-read aligner for DNA and RNA.To index:   bbmap.sh ref=<reference fasta>
To map:     bbmap.sh in=<reads> out=<output sam>
To map without writing an index:
    bbmap.sh ref=<reference fasta> in=<reads> out=<output sam> nodisk

mapnt.sh
Maps sequences to the nt database.

matrixtocolumns.sh
Turns identity matrices into 2-column format for plotting.

mergeOTUs.sh
Merges coverage stats lines (from pileup) for the same OTU,
according to some custom naming scheme.

megebarcodes.sh
Concatenates barcodes and quality onto read names.

msa.sh
Aligns a query sequence to reference sequences.
Outputs the best matching position per reference sequence.
If there are multiple queries, only the best-matching query will be used.

phylip2fasta.sh
Calculates per-scaffold coverage information from an unsorted sam file.

pileup.sh
Calculates per-scaffold coverage information from an unsorted sam file.

printtime.sh
Prints time elapsed since last called on the same file.

randomreads.sh
Generates random synthetic reads from a reference genome.  Read names indicate their genomic origin. Allows precise customization of things like insert size and synthetic mutation type, sizes, and rates. Read names generated by this program are used by MakeRocCure (samtoroc.sh) and GradeSamFile (gradesam.sh). They can also be used by BBMap (bbmap.sh) and BBMerge (bbmerge.sh) to automatically calculate true and false positive rates, if the flag 'parsecustom' is used.

readlength.sh
Generates a length histogram of input reads.

reformat.sh
Reformats reads to change ASCII quality encoding, interleaving, file format, or compression format.

removehuman.sh
Removes all reads that map to the human genome with at least 95% identity after quality trimming.

removesmartbell.sh
Remove Smart Bell adapters from PacBio reads

repair.sh
Re-pairs reads that became disordered or had some mates eliminated. (not repair)

rqcfilter.sh
Performs quality-trimming, artifact removal, linker-trimming, adapter trimming, and spike-in removal using BBDukF. Performs human contaminant removal using BBMap.

samtoroc.sh
Creates a ROC curve from a sam file of synthetic reads with headers generated by RandomReads3.java

seal.sh
Performs high-speed alignment-free sequence quantification,
by counting the number of long kmers that match between a read and
a set of reference sequences.  Designed for RNA-seq with alternative splicing.

shuffle.sh
Reorders reads randomly.

stats.sh
Generates basic assembly statistics such as scaffold count, N50, L50, GC content, gap percent, etc.

staswraapper.sh
Runs stats.sh on multiple assemblies to produce one ouput line per file.

synthmda.sh
Generates synthetic reads following an MDA-amplified singe cell's coverage distribution.

testformat.sh
Tests the format of a sequence file based on name and contents.

textfile.sh
Translates nucleotide sequences to all 6 amino acid frames.





2 comments:

  1. Hi Laurel,

    I hope you're finding BBMap useful! A couple of comments on these scripts:

    bbmapskimmer.sh: Like BBMap, except BBMap tries to find the single best mapping location for a read, and all other sites with the same or slightly lower score as the best site. BBMapSkimmer tries to find all sites that are above some minimum score threshold. Thus, if a read has a perfect match somewhere, and could also map at another place with, say, 5 mismatches, BBMap would NOT report the second site while Skimmer would (with default settings).

    calcmem.sh is a helper script to calculate the amount of free memory on the computer. This is necessary to give to Java as a parameter. All the other shellscripts call it.

    dedupe2.sh is like dedupe.sh but supports and unlimited number of kmers to hash per input sequence; dedupe.sh only supports 4 (2 prefixes and 2 suffixes). This is only relevant when doing deduplication/clustering with mismatches allowed. dedupe will automatically call dedupe2 if necessary.

    bbqc.sh, rqcfilter.sh, removehuman.sh, and mapnt.sh were not supposed to be released in the public distribution; they have hard-coded file paths and only run on JGI's cluster (Genepool). I'll remove them in the next release :)

    Feel free to contact me if you have any questions; I always like to hear how my software is being used in the real world!

    -Brian

    ReplyDelete
  2. Hi I have a question about bbduk.sh. I used bbduk for quality trim. I found some fastq files generates much less reads after bbduk. However, when I rerun the analysis, results change. The size of trimmed fastq is different with different out put. Is that common. Thank you. xb48@cornell.edu

    ReplyDelete