Friday, October 16, 2015

Installing pbh5tools on OSX yosemite

I need to assemble PacBio amplicons of olfactory receptors. One of the main challenges I have been having is that most of the PacBio tools (SMRT-analysis, Celara assembler) focus on assembling whole genomes. Using an approach from Larson, et al. (2014) in BMC Genomics, I am going to do quality filtering using pbh5tools that will filter the reads at a minimum of passes (e.g. 4; though my mean number of passes is quite high for the data (~20)).

To install pbh5tools, I dealt with a series of challenges on Yosemite:
Download from github.
cd pbh5tools-master/
sudo python install

I ran into this pesky error.
In file included from /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include/numpy/ndarraytypes.h:1760:
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: 
      "Using deprecated NumPy API, disable it by "          "#defining
#warning "Using deprecated NumPy API, disable it by " \
In file included from /tmp/easy_install-_W_eCV/h5py-2.5.0/h5py/defs.c:279:
/tmp/easy_install-_W_eCV/h5py-2.5.0/h5py/api_compat.h:27:10: fatal error: 
      'hdf5.h' file not found
#include "hdf5.h"
1 warning and 1 error generated.

error: Setup script exited with error: command 'cc' failed with exit status 1
First I tries to install hdf5 tool packages and then export the path there, but that wasn't working.

From the, I see that it needs:
        'pbcore >= 0.8.0',
        'numpy >= 1.6.0',
        'h5py >= 1.3.0'

sudo pip install pbcore
sudo pip install numpy
sudo pip install h5py

sudo python install
Using /Library/Python/2.7/site-packages
Searching for pysam==0.8.3
Best match: pysam 0.8.3
pysam 0.8.3 is already the active version in easy-install.pth

Using /Library/Python/2.7/site-packages
Finished processing dependencies for pbh5tools==0.8.0

Sunday, October 11, 2015

filing COEUS with DDIG at Stony Brook

Applying for an NSF DDIG for the first time is an experience full of lots of emotions. I tried to put together some advice to prevent a few less tears being shed.

One does not just simply "submit" a DDIG. It is a several week-long process full of joy and laughter, and the only way to prevent a stress-induced stomach ulcer is to sprinkle the experience with sarcastic remarks. Here is a short guide for getting yourself set-up in FASTLANE and submitting to COEUS (Stony Brook's version of the grant-submitting interface that mirrors online forms of the 1990s).

A true and serious piece of advice is to begin this two weeks before the deadline (at least). Believe me, procrastination will do you no favors. Do this before you have a draft together. The minimum you need to get started is the title of your project (which you can always change later). But you need to get started. Also, make sure you have your IACUC and IRB approval taken care of well in advance, because you will need this for COEUS as well.

[1] First, you must apply for an NSF FastLane login name. This is not something you sign up for on the NSF website, you have absolutely no control over this. Someone in your university will grant you this ID. This is different than the GFRP login if you are a GFRP fellow. Forget the beautifully designed FastLane submission form of the GFRP, where the sea-green trim borders all of the boxes and you actually may feel happy about life. Those days are in your past and you must now submit yourself to the harsh reality that the federal government does not have a budget for maintaining web submission forms post the early 2000s, and never will. To request your NSF ID (summarized in better detail here), you must e-mail your Department Grant Officer. For Ecology & Evolution, it is (at the time of writing) Gloria (and she is truly great!).

[2] Log in to FastLane. Go to, click "Proposals, Awards and Status". Using your new login information, login. Click "Proposal Functions", "Proposal Preparation", and "Prepare Proposal". Also, update any of your FastLane information. For a DDIG, your adviser (the PI) must do this first. Sit with them (or do it for them), with your NSF ID, and create the new proposal.

[3] Log in to COEUS and begin a new proposal. Your adviser may have to do this first and then add you. You first must at least fill out the first page of creating a new proposal. Either do this for your adviser (before you do it for you) or sit with your adviser. Under the tab "Investigators/Key Persons", they can now add you as a Co-PI. As an added Co-PI, you need to fill out a series of questions to certify.
sunshine and happiness
An example of my login page
Things to pay attention to-- the start date for the NSF DEB begins at the earliest 05/01 of the next year following submission and can last for 24 months. Note that the end date could not be 05/01/2018, but must be 04/30/2018. The title must begin with "DISSERTATION RESERCH:......".

[4] Add the Grants Management Officer (in this case, Gloria) as an aggregator. Before you yourself can do anything, the PI (your adviser) must give you permission to add and edit the proposal. On the left, click on "Proposal Roles". Add yourself and Gloria as an aggregator. 
You add the permissions for the people under Aggregator, the Grants Officer takes care of the rest
[5] Fill out "Credit Split" tab. I really had no clue what was going on in this tab. Basically, give your adviser all of the credit and make sure the numbers add up to 100. 

[6] Under the "Special Review" tab, upload any IACUC of IRB approvals associated with your project (there are certain numbers to report associated with your project. 

[7] Update the "Investigators/Key Persons" tab to reflect the amount of "Effort" you will be putting into the project. Even though the DDIG has no salary associated with it, this still must be reported Remember that the university cares about nothing except money and it likes to know where it should be distributed over what frame of time, even if its not even there. 

Notice that the C is referring to Calendar Year %Effort, not months. 1 month Calendar Year is 8.33% (1 / 12mo); 1 month AY 11.11% (1 / 9mo); 1 month summer 33.33% (1 / 3mo). My advisers effort is 1 month out of 12 calendar months. Mine is 11 out of 12 calendar months. This is where you can mess up because this effort MUST match what you also put in your CURRENT & PENDING documents. You can see my approval is still in progress, as an effect of how I did manage to mess this up. :-|

[8]Answer all the random questions in the tabs under "Questionnaire". I do recall these being relatively painless. 

[9] Prepare the documents you need to submit to COEUS. Don't do anything in the budget tab. Instead, fill out the "Budget Worksheet", an excel file where you add everything and then upload it under "Upload Documents". There should be three tabs, each with different forms you need to update. These forms change all of the time so make sure from the COEUS forms tab, you are downloading the right ones. Some documents you can generate the template yourself.

An example of a successful DDIG with all of the parts can be found here. Feel free to contact me if you would like more Stony Brook-related examples of some of these documents. 
  • Under the tab "Upload Proposal Documents"
    • COEUS project summary: This can be the DDIG project summary format (Overview, Intellectual Merit, Broader Impacts) or if this is not ready yet, you can submit a random paragraph from your project that summarizes the broad objectives. 
    • DDIG Facilities statement: Basically, you summarize anything equipment, computing facilities, or laboratories you will be in your project. Make a statement about the Stony Brook Libraries too. See the example above for a model of (in my opinion, bare minimum) facilities page. I will be happy to provide an example of thorough one.
    • DDIG Budget Justification: A one-two page summary that includes a timeline of how the money will be used in the proposal. Don't go super crazy (e.g. not necessary to include the exact quote for a Qiagen extraction kit), but at least have some estimates of each experiment in the objectives of your proposal, estimates of fieldwork costs, and . Also include an IDC statement in your description of Indirect Costs. (Literally copy and paste the following sentence: Foundation for SUNY at Stony Brook's most current approved rate agreement negotiated through DHHS, their federal cognizant agency, dated Feb. 26, 2015.) The Indirect Cost estimates can be calculated from the Budget Worksheet. See the PDF posted above for an example of this document.
  • Under the "Upload Personnel Attachments" tab:
    • For each person on the proposal, add the NSF Biosketch. An updated (2015) set of guidelines for a Biosketch are here. You can see an example in the PDF I posted above, but also check the most recent guidelines because things have changed.
    • For each person, add the Current & Pending forms. This includes all grants that you have, all grants you are applying for, and all grants that you are thinking of applying for. 
      Be sure to include this proposal as "Pending" and that the title is exactly the same as the title on the cover page of this proposal. Also on the C&P only - After the title put (this proposal) in parenthesis. 
  • Under the "Upload Institutional Attachments" tab:
    • Internal COI document for EACH PI and Co-PI declaring any conflicts of interest, even if there are none. The more recent of this form is available on the COEUS website.
    • COEUS proposal form: Similar to the information declared under "Investigators/Key Persons" tab. This information must also match, as well as what was under C&P. This form is available to download on the COEUS webpage.
    • Budget Spreadsheet. This is a handy spreadsheet you can download from the COEUS webpage where you enter the values of your budget per year and then it estimates the Indirect Costs. This supplements doing anything under the Budget Tab in COEUS.
Once you have all of this uploaded to your COEUS, contact your Gloria-person and tell her you are ready to submit. She will check and tell you if you need to fix anything (you probably do, so this is why you need to start way in advance). Thinks like the Current & Pending, Biosketch, Budget Justification, and Facilities also need to be uploaded to FastLane. You need to fill out your budget on FastLane as well. 

When you are getting ready to submit your DDIG (this day will come), you will have to "Allow SRO access" on your FastLane account so that the Gloria person can review and submit (no, you can't do that yourself). I hope you have come to the realization that you have so little control over your life. 

Live, learn, endure. You can do it!

Thursday, October 1, 2015

Is deduplication for you?
Re: @brianbushnell on seqanswers:Deduplication of reads is possible with dedupe, but it requires a lot of memory (~1kb per read) so it is only practical with HiSeq data if you have high-memory nodes. If you have a reference, deduplication based on mapping will use much less memory, though it will also be much slower. And whether or not deduplication is a good idea depends on the experiment; i.e., probably not for quantitative RNA-seq, and also probably not for unamplified libraries in general.

Suggested order:
1. Initial quality control BGI did this
2. Quality and/or length-based reads discarding; trimming/discarding of N-containing reads BGI did this
3. Adapter removal BGI did this
4. phix/contaminant removal
5. [optional] Error-correction --use
6. [optional] Deduplication (depending on experiment) --use
7. Merging of PE reads --use
8. Quality trimming + phix/contaminant removal --use
9. Contamination check

More advice:
As for the order of quality trimming and merge, they could be done either way. But, BBMerge internally performs quality-trimming for reads that don't merge prior to quality trimming, so for that particular program, quality-trimming after merging tends to be better.

Note that after merging, you will have 2 sets of reads (merged and unmerged) that must be processed independently, since not all of them will successfully merge

Also useful reading about merging your paired end reads before assembly:

Okay, time to try...

Step 1: Quality trimming

#quality control trimming of paired-end reads
#DR086_E.bombifrons_MOE quality control and plots -Xmx1g in1=DR086_Erbom_MOE_1.fq in2=DR086_Erbom_MOE_2.fq out1=DR086_Erbom_MOE_1_clean.fq out2=DR086_Erbom_MOE_2_clean.fq minlen=25 qtrim=rl trimq=10 qhist=qc_plots/DR086_Erbom_MOE_qhist.txt aqhist=qc_plots/DR086_Erbom_MOE_aqhist.txt lhist=qc_plots/DR086_Erbom_MOE_readLength.txt bqhist=qc_plots/DR086_Erbom_MOE_bqhist.txt overwrite=true

About 8-9% of each read was trimmed.

Normal level sequence duplication in mammals is 20 million reads
Normal sequence bias at beginning of reads due to nonrandom hybridization of random primers.

#I put this all in a shell script
sh >log.txt

I tried to do error correction but these files are HUGE!! Need to remove duplicates first.
sh in1=DR086_Erbom_MOE_1.fq in2=DR086_Erbom_MOE_2.fq out1=DR086_Erbom_MOE_1_corrected.fq out2=DR086_Erbom_MOE_2_corrected.fq
/usr/local/bin/bbmap// line 111: [: unlimited: integer expression expected
java -ea -Xmx-410m -Xms-410m -cp /usr/local/bin/bbmap/current/ jgi.KmerNormalize bits=16 ecc=t passes=1 keepall dr=f prefilter in1=DR086_Erbom_MOE_1.fq in2=DR086_Erbom_MOE_2.fq out1=DR086_Erbom_MOE_1_corrected.fq out2=DR086_Erbom_MOE_2_corrected.fq
Invalid maximum heap size: -Xmx-410m
Could not create the Java virtual machine.

#even when I increased the memory, no cigar
sh in1=DR086_Erbom_MOE_1.fq in2=DR086_Erbom_MOE_2.fq out1=DR086_Erbom_MOE_1_corrected.fq out2=DR086_Erbom_MOE_2_corrected.fq -Xmx20g
Exception in thread "Thread-60" java.lang.NoClassDefFoundError: java/util/concurrent/ThreadLocalRandom
        at jgi.KmerNormalize$

Step 2: Deduplication in=NBS1170_Desrot_MOE_1_clean.fq out=NBS1170_Desrot_MOE_1_clean_dedupe.fq am ac fo renameclusters=f storename=f pto cc qin=33 csf=stats.txt -Xmx20g overwrite=t in=NBS1170_Desrot_MOE_2_clean.fq out=NBS1170_Desrot_MOE_2_clean_dedupe.fq am ac fo renameclusters=f storename=f pto cc qin=33 csf=NBS1170_Desrot_MOE_2_clean_dedupe_stats.txt -Xmx20g overwrite=t

Step 3: Merge Paired End Reads
#merge PE in1=NBS1170_Desrot_MOE_1_clean.fq.gz in2=NBS1170_Desrot_MOE_2_clean.fq.gz out=NBS1170_Desrot_MOE_merged.fq -Xmx20g

Step 3.5: Could also remove the duplicates after merged paired ends
#remove duplicates of the two lanes in=NBS1170_Desrot_MOE_merged.fq out=NBS1170_Desrot_MOE_merged_dedupe.fq am ac renameclusters=f storename=f pto cc qin=33 csf=dedupe_logs/NBS1170_Desrot_MOE_merged_dedupe_stats.txt dot=dedupe_logs/ -Xmx20g overwrite=t