Laurel's Lab Notebook: January 2015

Friday, January 23, 2015

Compilation function of each BBmap shell scripts:

addadapters.sh

Randomly adds adapters to a file, or grades a trimmed file.

bbcountunique.sh

Generates a kmer uniqueness histogram, binned by file position.

There are 3 columns for single reads, 6 columns for paired:

count       number of reads or pairs processed

r1_first    percent unique 1st kmer of read 1

r1_rand     percent unique random kmer of read 1

r2_first    percent unique 1st kmer of read 2

r2_rand     percent unique random kmer of read 2

pair        percent unique concatenated kmer from read 1 and 2

bbduk.sh

Compares reads to the kmers in a reference dataset, optionally 

allowing an edit distance. Splits the reads into two outputs - those that 

match the reference, and those that don't. Can also trim (remove) the matching 

parts of the reads rather than binning the reads. Can use for quality trimming.

bbduk2.sh

Compares reads to the kmers in a reference dataset, optionally 

allowing an edit distance. Splits the reads into two outputs - those that 

match the reference, and those that don't. Can also trim (remove) the matching 

parts of the reads rather than binning the reads.

bbest.sh

Calculates EST (expressed sequence tags) capture by an assembly from a sam file.

Designed to use BBMap output generated with these flags: k=13 maxindel=100000 custom tag ordered

bbfakereads.sh

Generates fake read pairs from ends of contigs or single reads.

bbmap.sh

Fast and accurate short-read aligner for DNA and RNA.

bbmapskimmer.sh

Fast and accurate short-read aligner for DNA and RNA. (not sure difference from above)

bbmask.sh

Masks sequences of low-complexity, or containing repeat kmers, or covered by mapped reads.

bbmerge.sh

Merges paired reads into single reads by overlap detection.

With sufficient coverage, can also merge nonoverlapping reads using gapped kmers.

bbmergegapped.sh

Merges paired reads into single reads by overlap detection.

With sufficient coverage, can also merge nonoverlapping reads using gapped kmers.

bbnorm.sh

Normalizes read depth based on kmer counts.

Can also error-correct, bin reads by kmer depth, and generate a kmer depth histogram.

bbqc.sh

Performs quality-trimming; artifact, human, and phiX removal; adapter-trimming; error-correction and normalization. Designed for Illumina fragment libraries only.

bbrename.sh

Renames reads to <prefix>_<number> where you specify the prefix and the numbers are ordered.

bbsplit.sh

Maps reads to multiple references simultaneously.

Outputs reads to a file for the reference they best match, with multiple options for dealing with ambiguous mappings.

bbsplitpairs.sh

Separates paired reads into files of 'good' pairs and 'good' singletons by removing 'bad' reads that are shorter than a min length.

Designed to handle situations where reads become too short to be useful after trimming.  This program also optionally performs quality trimming.

bbwrap.sh

Wrapper for BBMap to allow multiple input and output files for the same reference.

calcmem.sh

????

calctruequality.sh

Calculates the observed quality scores from a sam file.

callpeaks.sh

No description--but appears related to keeping only certain reads within a certain region of histogram.

countbarcodes.sh

Counts the number of reads with each barcode.

countgc.sh

Counts GC content of reads or scaffolds.

crosscontaminate.sh

Generates synthetic cross-contaminated files from clean files.

Intended for use with synthetic reads generated by RandomReads.

cutprimers.sh

Cuts out sequences corresponding to primers identified in sam files.

decontaminate.sh

Decontaminates multiplexed assemblies via normalization and mapping.

dedupe.sh

Accepts one or more files containing sets of sequences (reads or scaffolds).

Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity. Can also find overlapping sequences and group them into clusters.

dedupe2.sh

Accepts one or more files containing sets of sequences (reads or scaffolds).

Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity.Can also find overlapping sequences and group them into clusters. (Not sure difference between dedupe.sh)

demuxbyname.sh

Demultiplexes reads based on their name (suffix or prefix) into multiple files.

ecc.sh

Corrects substitution errors in reads using kmer depth information.

Can also normalize and/or bin reads by kmer depth.

filterbarcodes.sh

Filters barcodes by quality, and generates quality histograms.

filterbycoverage.sh

Filters an assembly by contig coverage.

filterbyname.sh

Filters reads by name.

getreads.sh

Gets reads by number. The first read (or pair) has ID 0, the second read (or pair) has ID 1, etc.

grademerge.sh

Grades correctness of merging synthetic reads with headers generated by RandomReads and re-headered by RenameReads

gradesam.sh

Grades mapping correctness of a sam file of synthetic reads with headers generated by RandomReads3.java

idmatrix.sh

Generates an identity matrix via all-to-all alignment.

khist.sh

Generates a histogram of kmer counts for the input reads or assemblies.

Can also normalize, error-correct, and/or bin reads by kmer depth.

kmercount.sh

--counts kmers?

kmercountexact.sh

Counts the number of unique kmers in a file.

makechimeras.sh

Makes chimeric PacBio reads from nonchimeric reads.

mapPacBio.sh

Fast and accurate short-read aligner for DNA and RNA.

mapPacBio8k.sh

Description:  Fast and accurate short-read aligner for DNA and RNA.To index:   bbmap.sh ref=<reference fasta>

To map:     bbmap.sh in=<reads> out=<output sam>

To map without writing an index:

    bbmap.sh ref=<reference fasta> in=<reads> out=<output sam> nodisk

mapnt.sh

Maps sequences to the nt database.

matrixtocolumns.sh

Turns identity matrices into 2-column format for plotting.

mergeOTUs.sh

Merges coverage stats lines (from pileup) for the same OTU,

according to some custom naming scheme.

megebarcodes.sh

Concatenates barcodes and quality onto read names.

msa.sh

Aligns a query sequence to reference sequences.

Outputs the best matching position per reference sequence.

If there are multiple queries, only the best-matching query will be used.

phylip2fasta.sh

Calculates per-scaffold coverage information from an unsorted sam file.

pileup.sh

Calculates per-scaffold coverage information from an unsorted sam file.

printtime.sh

Prints time elapsed since last called on the same file.

randomreads.sh

Generates random synthetic reads from a reference genome.  Read names indicate their genomic origin. Allows precise customization of things like insert size and synthetic mutation type, sizes, and rates. Read names generated by this program are used by MakeRocCure (samtoroc.sh) and GradeSamFile (gradesam.sh). They can also be used by BBMap (bbmap.sh) and BBMerge (bbmerge.sh) to automatically calculate true and false positive rates, if the flag 'parsecustom' is used.

readlength.sh

Generates a length histogram of input reads.

reformat.sh

Reformats reads to change ASCII quality encoding, interleaving, file format, or compression format.

removehuman.sh

Removes all reads that map to the human genome with at least 95% identity after quality trimming.

removesmartbell.sh

Remove Smart Bell adapters from PacBio reads

repair.sh

Re-pairs reads that became disordered or had some mates eliminated. (not repair)

rqcfilter.sh

Performs quality-trimming, artifact removal, linker-trimming, adapter trimming, and spike-in removal using BBDukF. Performs human contaminant removal using BBMap.

samtoroc.sh

Creates a ROC curve from a sam file of synthetic reads with headers generated by RandomReads3.java

seal.sh

Performs high-speed alignment-free sequence quantification,

by counting the number of long kmers that match between a read and

a set of reference sequences.  Designed for RNA-seq with alternative splicing.

shuffle.sh

Reorders reads randomly.

stats.sh

Generates basic assembly statistics such as scaffold count, N50, L50, GC content, gap percent, etc.

staswraapper.sh

Runs stats.sh on multiple assemblies to produce one ouput line per file.

synthmda.sh

Generates synthetic reads following an MDA-amplified singe cell's coverage distribution.

testformat.sh

Tests the format of a sequence file based on name and contents.

textfile.sh

Translates nucleotide sequences to all 6 amino acid frames.

Thursday, January 22, 2015

Intention is to extend the ORA pipeline to scan for all chemosensory related genes.

Take 15 random V1Rs from the Young, et al (2010) paper.
-->Should you only take functional genes as training sequences?

At first I was aligning the sequences with ClustalOmega and trying to put the sequences in .msf format, as many HMMER tutorials show. However, I could not for the life of me figure out what was wrong and kept getting the following error:

hmmbuild vno.hmm vno_trainerSeqs.msf

Alignment input open failed.

   couldn't determine alignment input format

   while reading file vno_trainerSeqs.msf

So I gave up on .msf and tried PHYLIP--seemed to have worked fine

head vno_trainerSeqs.msf

 15  384

v1r-10    ----MLKLVIIENMAEIMLFSLDLLLFSTDIL----CFNFPSKMIKLPGF

v1r-06    ---------------------------------------MMNKNSRLYTD

v1r-04    ----------------------------------------------MAVD

v1r-12    -------------------------------------------------M

v1r-01    -------------------------------------MSAHGNSLKTTEE

v1r-14    ---------------------------------------------MLTYD

v1r-15    ---------------------------------------------MSSHK

v1r-03    ---------------------------------------------MDITE

v1r-02    -------------------------------------------MKMTSSN

v1r-08    ---------------------------------------------MSSAK

v1r-11    MVGDTLKLLSPL-MTRY----FFLLFYSTDSSDLNENQHPLDFDEMAFGK

v1r-05    ---------------------------------------------MDARN

v1r-13    ---------------------------------------------MASKD

v1r-07    -------MIHVD-RDSY----PL-----AGFSSSEDKYLSLTTDRKASRE

v1r-09    ---------------------------------------------MATGD

ITIQIFFYPQASFGISANTILLLFHIFTFVFSHRSKSIDMIISHLSLIHI

SNIRNTFFAEIGIGVSANSLLLLFNIFKLICGQRSRLTDLPIGLLSLINL

VAQGVSFLYQTGLGILGNSLLITLYLTSFLLGSKLKPTDLTIIHLALVHT

LSFKKAFYFQAGIGISANIFLLLWHIFTFFKDHKPKNHDLIICHLAFAHI

VALQILLLCQFGVGTVANVFLFVHNFSPVLTGSKQRPRQVILSHMAVANA

DFMCIFHKLQTIIGLFGNSFLLYLYILKLIINQRITLIDKICINLVFSNI

VGLEIVYLTLLLFGILGNMFLIYLQSLKFITDHRKRVINLIIINLALAHT

LSFGIAIVMQFSIGVSVNVFVFLFYAQIISTSYKASFSDLILAHLAFANT

LVVGILLFSQIVMGMLGNSSILFYYVILIFTGKHLTPKDLIIEHLTFANC

WETRIILVAQMGVGILGNTSLFCLCNFTLFTGQKVRCTDIILSQLALANS

VKSGISFLIQTGVGILGNSFLLCFYNLILFTGHKLRPTDLILSQLALANS

LGVGITYLLQSVVGLLGNISLIFYYLIIYYKKHKIKPMDLILMHLIIVNI

FAIGMIL-SQIMVGFLGNFFLLYHYSFLHFTRGMLQSTDLTLKHLTIANS

LVVGIILSLQTTFGILGNFSLLYHYLFLYFTGSRLRSTDLIVKNLIVANL

LVVGMVFLSQTILGILGNFSLLCHYLLLHFTGCRARCTDLILRHLTIANS

Alrighty. Will this now work with ORA?
It appears so! Actually is not too difficult.
[1] Put the alignment file of the training receptors and run "hmmbuild" to make the HMMER probability file.

hmmbuild vno.hmm vno_trainerSeqs.phy

# hmmbuild :: profile HMM construction from multiple sequence alignments

# HMMER 3.1b1 (May 2013); http://hmmer.org/

# Copyright (C) 2013 Howard Hughes Medical Institute.

# Freely distributed under the GNU General Public License (GPLv3).

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# input alignment file:             vno_trainerSeqs.phy

# output HMM file:                  vno.hmm

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# idx name                  nseq  alen  mlen eff_nseq re/pos description

#---- -------------------- ----- ----- ----- -------- ------ -----------

1     vno_trainerSeqs         15   384   315     2.03  0.590 

[3] In the "scripts" folder of the ORA pipeline, append the vno.hmm text to the or.hmm text. Make sure there is a new line after the last "//" or it will not press properly.

[4] Delete "or.hmm.h3f", "or.hmm.h3i", "or.hmm.h3m", and "or.hmm.h3p". Then repress the new or.hmm:
hmmpress or.hmm
Looks like with V2R, the ORA is picking up things like glutamate receptors (GRM3), and GABA receptors. Wonder why....will need to think about and retrain. I'm assuming the introns are throwing it off.

Troubleshooting fastx-toolkit. Trying to resolve this error. I have uninstalled the newer versions of both libgtextutils and fastx_toolkit and am reverting to hopefully a more stable version.

The "Abort trap: 6" was never really resolved for me but after running this version and compiling from source, the error does go away (can't say it is resolved). However, a new suite of errors with the older version pops up. Curiously, this is accounted for in the newer version. The errors are similar to the installation of phylobayes. I have troubleshooted them here and have a successful installation. Now hopefully I can move on with my life. I am going to try something new.

Anyways, here is the installation error workaround:

cd /usr/local/bin/
wget http://cancan.cshl.edu/labmembers/gordon/files/libgtextutils-0.6.tar.bz2 #note this is not latest version
tar -xjf libgtextutils-0.6.tar.bz2
cd libgtextutils-0.6
./configure
make
sudo make install

wget http://cancan.cshl.edu/labmembers/gordon/files/fastx_toolkit-0.0.12.tar.bz2 #note this is not latest version
tar -xjf fastx_toolkit-0.0.12.tar.bz2  
cd fastx_toolkit-0.0.12
./configure
make
g++ -DHAVE_CONFIG_H -I. -I../..   -I../libfastx   -g -O2 -Wall -Wextra -Wformat-nonliteral -Wformat-security -Wswitch-default -Wswitch-enum -Wunused-parameter -Wfloat-equal -Werror -DDEBUG -g -O1 -DDEBUG -g -O1 -MT fastx_collapser.o -MD -MP -MF .deps/fastx_collapser.Tpo -c -o fastx_collapser.o fastx_collapser.cpp
fastx_collapser.cpp:50:10: fatal error: 'tr1/unordered_map' file not found
#include <tr1/unordered_map>
         ^
1 error generated.
make[3]: *** [fastx_collapser.o] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2
cd src/
cd fastx_collapser
vi fastx_collapser.cpp
#press i to edit in vi

#original
#include <tr/unordered_map>
#new

#include <unordered_map>
#save and exit, press ESC and then type ":wq!"

cd ../..
make
fastx_collapser.cpp:51:6: error: no member named 'tr1' in namespace 'std'
std::tr1::unordered_map<string,size_t> collapsed_sequences;
~~~~~^
1 error generated.
make[3]: *** [fastx_collapser.o] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2
cd src/fastx_collapser
vi fastx_collapser.cpp
#original

std::tr1::unordered_map<string,size_t> collapsed_sequences;
#new

std::unordered_map<string,size_t> collapsed_sequences;
sudo make install

Wednesday, January 21, 2015

Downloading new transcriptome data from BGI!

#Enter the ftp address in terminal window

ftp 128.120.88.244

Connected to 128.120.88.244.

220 (vsFTPd 2.2.2)

#input given username and password

#copy from ftp into your current directory

get DR_086_MO_E_1.fq.gz

get DR_086_MO_E_2.fq.gz

#or we can get multiple files

mget DR_086_VN*.fq.gz

#or we can get all the files!

mget *.fq.gz

exit

Now we can unzip the data, concatenate the two files for each species, and run the quality control filter