Wednesday, December 2, 2015

Running RELAX.


We are looking for relaxed selection in Trpc2 of several lineages of bats that have reduced morphology of vomeronasal system organs or brain regions.

To run RELAX:
RELAX is relatively recent and you have to download the most recent batch files for HyPhy.
Start HyPhy.
Open->Open Batch File...
Navigate to Applications/hyphy/res/TemplateBatchFiles/
Select RELAX.bf
It will automatically prompt you to select your data/alignment file.
Then select your tree file.
Make sure the correct number of Ts, Rs, and Us match.
It should hopefully begin, assuming you have your tree set up correctly and your alignment does not have any stop codons AND it is in frame.

Something VERY critical if you are running it in the GUI...you must select the test branches directly when it asks for it, not just when you put them in the tree. Otherwise it estimates the reference branches as the test branches and vice versa.

RELAX1:
Alignment: bat_data_match.nex
Tree: bat_tree_match_RELAX1.tre
(((((((((((((((((((((Artibeus_jamaicensis{R}:2.431373248,Artibeus_obscurus{R}:2.431373248){R}:0.3588412223,Artibeus_fimbriatus{R}:2.79021447){R}:0.9561230126,(Artibeus_intermedius{R}:0.6313580609,Artibeus_lituratus{R}:0.6313580609){R}:3.114979422){R}:1.424889238,Artibeus_inopinatus{R}:5.171226721){R}:0.9218212476,Artibeus_concolor{R}:6.093047969){R}:2.392412914,((((Artibeus_watsoni{R}:2.760434077,Artibeus_incomitatus{R}:2.760434077){R}:1.940812275,Artibeus_phaeotis{R}:4.701246352){R}:0.9218549563,Artibeus_cinereus{R}:5.623101308){R}:0.8637586984,(Artibeus_glaucus{R}:3.316302056,Artibeus_gnomus{R}:3.316302056){R}:3.170557951){R}:1.998600876){R}:2.413693209,(((Ariteus_flavescens{R}:2.487096883,Ardops_nichollsi{R}:2.487096883){R}:2.828400699,Phyllops_falcatus{R}:5.315497583){R}:0.8401618707,((Pygoderma_bilabiatum{R}:3.602361855,Sphaeronycteris_toxophyllum{R}:3.602361855){R}:1.181679598,Centurio_senex{R}:4.784041453){R}:1.371618001){R}:4.743494639){R}:1.13714402,Ectophylla_alba{R}:12.03629811){R}:0.9420299332,Enchisthenes_hartii{R}:12.97832805){R}:1.825921019,((((((((Platyrrhinus_nigellus{R}:2.504001415,Platyrrhinus_dorsalis{R}:2.504001415){R}:0.4677903585,(Platyrrhinus_infuscus{R}:2.110403375,Platyrrhinus_aurarius{R}:2.110403375){R}:0.8613883988){R}:1.095202405,Platyrrhinus_albericoi{R}:4.066994179){R}:2.177118584,Platyrrhinus_lineatus{R}:6.244112763){R}:1.779671702,Vampyrodes_caraccioli{R}:8.023784464){R}:2.695962001,(Vampyressa_thyone{R}:7.747059999,Mesophylla_macconnelli{R}:7.747059999){R}:2.972686466){R}:0.777302653,(((Chiroderma_villosum{R}:1.968379874,Chiroderma_improvisum{R}:1.968379874){R}:1.432916649,Chiroderma_doriae{R}:3.401296523){R}:5.790148924,(Vampyressa_brocki{R}:7.444560681,Vampyressa_bidens{R}:7.444560681){R}:1.746884766){R}:2.305603671){R}:0.737544275,Uroderma_magnirostrum{R}:12.23459339){R}:2.569655671){R}:2.978950491,(((Sturnira_bogotensis{R}:4.351270017,Sturnira_tildae{R}:4.351270017){R}:0.6450817233,Sturnira_magna{R}:4.99635174){R}:0.4749192885,(Sturnira_oporaphilum{R}:0.9163878452,Sturnira_lilium{R}:0.9163878452){R}:4.554883183){R}:12.31192853){R}:1.896666128,Rhinophylla_fischerae{R}:19.67986568){R}:1.622496147,(((((Carollia_perspicillata{R}:2.166884003,Carollia_brevicauda{R}:2.166884003){R}:0.7022808564,Carollia_sowelli{R}:2.869164859){R}:1.049593577,Carollia_subrufa{R}:3.918758436){R}:4.394363478,Carollia_castanea{R}:8.313121914){R}:11.76400773,(Glyphonycteris_sylvestris{R}:15.43079994,Trinycteris_nicefori{R}:15.43079994){R}:4.6463297){R}:1.225232188){R}:1.57140971,((((Lonchophylla_hesperia{R}:9.371624678,Lionycteris_spurrelli{R}:9.371624678){R}:2.706833404,Platalina_genovensium{R}:12.07845808){R}:1.204240126,Lonchophylla_chocoana{R}:13.28269821){R}:2.390290187,Lonchophylla_mordax{R}:15.6729884){R}:7.200783145){R}:1.155625037,((((((Glossophaga_longirostris{R}:5.673316434,Glossophaga_commissarisi{R}:5.673316434){R}:1.473493579,Glossophaga_soricina{R}:7.146810013){R}:4.41746123,(Leptonycteris_nivalis{R}:1.825719598,Leptonycteris_curasoae{R}:1.825719598){R}:9.738551645){R}:6.408254338,(Brachyphylla_pumila{T}:1.377027909,Brachyphylla_cavernarum{T}:1.377027909){T}:16.59549767){R}:1.44044124,(((Musonycteris_harrisoni{T}:1.250127196,Choeronycteris_mexicana{T}:1.250127196){T}:4.862545039,Choeroniscus_godmani{T}:6.112672234){T}:10.9120772,((Anoura_latidens{R}:3.779427833,Anoura_geoffroyi{R}:3.779427833){R}:2.83493221,Anoura_caudifer{R}:6.614360043){R}:10.41038939){R}:2.388217391){R}:3.771706573,(Lonchorhina_aurita{R}:8.795556441,Lonchorhina_inusitata{R}:8.795556441){R}:14.38911695){R}:0.8447231842){R}:0.9614243908,(((((Lophostoma_evotis{R}:2.279691562,Lophostoma_silvicolaFG{R}:2.279691562){R}:7.690705428,Lophostoma_brasiliense{R}:9.97039699){R}:7.599298001,(Phyllostomus_elongatus{R}:8.119582824,Phyllostomus_discolor{R}:8.119582824){R}:9.450112167){R}:1.521869862,(Tonatia_saurophila{R}:4.502755885,Tonatia_bidens{R}:4.502755885){R}:14.58880897){R}:3.674523748,(Chrotopterus_auritus{R}:19.15333438,Mimon_cozumelae{R}:19.15333438){R}:3.61275422){R}:2.224732367){R}:2.433669076,((Diaemus_youngi{R}:12.39244932,Desmodus_rotundus{R}:12.39244932){R}:7.835663972,Diphylla_ecaudata{R}:20.22811329){R}:7.196376756){R}:0.8214752215,(((Micronycteris_microtis{R}:9.018674346,Micronycteris_hirsuta{R}:9.018674346){R}:5.324423535,(Micronycteris_minuta{R}:4.882043938,Micronycteris_schmidtorum{R}:4.882043938){R}:9.461053943){R}:8.513447254,Lampronycteris_brachyotis{R}:22.85654514){R}:5.389420131){R}:7.944937657,((((Pteronotus_quadridens{T}:6.670886247,Pteronotus_macleayii{T}:6.670886247){T}:4.04633203,Pteronotus_davyi{T}:10.71721828){T}:3.138003269,Pteronotus_parnellii{T}:13.85522155){T}:18.91999925,Mormoops_megalophylla{T}:32.7752208){T}:3.415682128){R}:5.869690062,Thyroptera_lavali{T}:42.06059299){R}:13.44742223,((((((((Miniopterus_aelleni{R}:1.299615,Miniopterus_petersoni{R}:1.299615){R}:4.270025,Miniopterus_gleni{R}:5.569641){R}:2.003838,Miniopterus_sororculus{R}:7.573479){R}:2.031566,Miniopterus_minor{R}:9.605045){R}:14.889341,(Miniopterus_fuliginosus{R}:17.236618,Miniopterus_schreibersii{R}:17.236618){R}:7.257769){R}:27.640976,Myotis_elegans{T}:52.13536){R}:1.662856,Molossus_sinaloae{T}:53.798217){R}:0.976298,Chilonatalus_micropus{T}:54.774513){R}:0.873379);


Load this file into HyPhy-vision
Results from the first run (Important results are in the .json file):

Test for selection relaxation (K = 0.43) was significant (p = 0.0000, LR = 46.62)

  • Modellog L# par.AICcTime to fitLtreeBranch setω1ω2ω3
    Partitioned MG94xREV-4116.852188675.524 min. 44 sec.1.33Reference0.102 (100%)
    Test0.367 (100%)
    General Descriptive-4024.514228915.0415 min. 55 sec.3.99All0.156 (100%)0.601 (0.0%)10.7 (0.0%)
    Null-4127.442218702.867 min. 9 sec.4.05Reference0.139 (99%)0.141 (0.48%)53.5 (0.020%)
    Test0.139 (99%)0.141 (0.48%)53.5 (0.020%)
    Alternative-4104.132228658.3034 min. 33 sec.3.99Reference0.0563 (5.6%)0.0763 (94%)3.83 (0.57%)
    Test0.290 (5.6%)0.331 (94%)1.78 (0.57%)
    Partitioned Exploratory-4104.052268666.3612 min. 26 sec.3.99Reference0.0231 (5.9%)0.0743 (93%)3.16 (0.84%)
    Test0.290 (0.0%)0.334 (100%)1.42 (0.0%)









Model (click to enlarge): Alternative (A), General Descriptive (B)


ω distributions under the Alternative model

Test branches are shown in blue and reference branches are shown in red

ω distributions under the Null model

Test branches are shown in blue and reference branches are shown in red

ω distributions under the Partitioned Exploratory model

Test branches are shown in blue and reference branches are shown in red

ω distributions under the Partitioned MG94xREV model

Test branches are shown in blue and reference branches are shown in red


The x-axis of the graphs comes out horribly, but other than that, I'm pleased with the visualization package.

Monday, November 23, 2015

Find the best model of evolution for Trpc2 using modelomatic:
#For this analysis, all stop codons need to be removed.
#For troubleshooting modelOmatic, I wrote a more detailed post here.
#Navigate to the correct directory and run this command:
modelomatic_OSX Trpc2_alignment_noStops.phy bionj Trpc2_modelomatic_normal.txt 0 normal

Interestingly REV+4dG is the best model, despite it being a protein coding gene. This is essentially "GTR+G". The 4dG is showing the four rate categories (this is an arbitrary number) for the gamma parameter.

Find the "best" (ML) tree in GALI using search replicates:
#run this command
garli Trpc2_garli.conf
#Note how to implement GTR+G in GARLI. See here for more examples of other models.
#Here is my configuration file (Trpc2_garli.conf):
[general]
datafname = Trpc2_garli.nex
constraintfile = none
streefname = stepwise
attachmentspertaxon = 50
ofprefix = Trpc2.GTR4DG
randseed = -1
availablememory = 512
logevery = 10
saveevery = 100
refinestart = 1
outputeachbettertopology = 0
outputcurrentbesttopology = 0
enforcetermconditions = 1
genthreshfortopoterm = 20000
scorethreshforterm = 0.05
significanttopochange = 0.01
outputphyliptree = 0
outputmostlyuselessfiles = 0
writecheckpoints = 0
restart = 0
outgroup = 1
resampleproportion = 1.0
inferinternalstateprobs = 0
outputsitelikelihoods = 0
optimizeinputonly = 0
collapsebranches = 1

searchreps = 8
bootstrapreps = 0

[model1]
datatype = nucleotide
ratematrix = 6rate
statefrequencies = estimate
ratehetmodel = gamma
numratecats = 4
invariantsites = none

[master]
nindivs = 4
holdover = 1
selectionintensity = 0.5
holdoverpenalty = 0
stopgen = 5000000
stoptime = 5000000

startoptprec = 0.5
minoptprec = 0.01
numberofprecreductions = 10
treerejectionthreshold = 50.0
topoweight = 1.0
modweight = 0.05
brlenweight = 0.2
randnniweight = 0.1
randsprweight = 0.3
limsprweight =  0.6
intervallength = 100
intervalstostore = 5
limsprrange = 6
meanbrlenmuts = 5
gammashapebrlen = 1000
gammashapemodel = 1000
uniqueswapbias = 0.1
distanceswapbias = 1.0

Putting bootstrap values on GARLI:
You can bootstrap in GARLI, but to "put" the bootstraps on the nodes, you have to summarize the bootstrap results using an outside method. SumTrees.py in the Dendropy python package works well. I basically followed the exact instructions somebody wrote here.

Running bootstrap iterations in GARLI:
#I ran four independent runs simultaneously of 250 bootstrap iterations that will combine to = 1000.
#Here is my GARLI configuration file:
[general]
datafname = Trpc2_garli.nex
constraintfile = none
streefname = stepwise
attachmentspertaxon = 50
ofprefix = Trpc2.GTR4DG.boot
randseed = -1
availablememory = 512
logevery = 10
saveevery = 100
refinestart = 1
outputeachbettertopology = 0
outputcurrentbesttopology = 0
enforcetermconditions = 1
genthreshfortopoterm = 20000
scorethreshforterm = 0.05
significanttopochange = 0.01
outputphyliptree = 0
outputmostlyuselessfiles = 0
writecheckpoints = 0
restart = 0
outgroup = 1
resampleproportion = 1.0
inferinternalstateprobs = 0
outputsitelikelihoods = 0
optimizeinputonly = 0
collapsebranches = 1

searchreps = 1
bootstrapreps = 250

[model1]
datatype = nucleotide
ratematrix = 6rate
statefrequencies = estimate
ratehetmodel = gamma
numratecats = 4
invariantsites = none

[master]
nindivs = 4
holdover = 1
selectionintensity = 0.5
holdoverpenalty = 0
stopgen = 5000000
stoptime = 5000000

startoptprec = 0.5
minoptprec = 0.01
numberofprecreductions = 10
treerejectionthreshold = 20.0
topoweight = 1.0
modweight = 0.05
brlenweight = 0.2
randnniweight = 0.1
randsprweight = 0.3
limsprweight =  0.6
intervallength = 100
intervalstostore = 5
limsprrange = 6
meanbrlenmuts = 5
gammashapebrlen = 1000
gammashapemodel = 1000
uniqueswapbias = 0.1
distanceswapbias = 1.0

Installing dendropy
#We have "pip" set up on our server, making the python package easy to install:
sudo pip install -U dendropy

Running SumTrees
#place all of your bootstrap outputs and your best tree from GARLI in its own folder

cd /Volumes/Yango/Trpc2/garli/

mkdir sumtrees
cp Trpc2.GTR4DG.best.tre sumtrees/
cp Trpc2.GTR4DG.boot.boot.tre sumtrees/
cp Trpc2.GTR4DG.boot2.boot.tre sumtrees/
cp Trpc2.GTR4DG.boot3.boot.tre sumtrees/
cp Trpc2.GTR4DG.boot4.boot.tre sumtrees/

#from within the folder, run sumtrees.py
#your target tree is the "best" tree from your GARLI output

sumtrees.py --decimals=0 --percentages --no-analysis-metainformation --no-taxa-block Trpc2.GTR4DG.boot.boot.tre Trpc2.GTR4DG.boot2.boot.tre Trpc2.GTR4DG.boot3.boot.tre Trpc2.GTR4DG.boot4.boot.tre --target=Trpc2.GTR4DG.best.tre --output=supportOnBest.phy

Viewing tree with bootstraps
Now, open supportOnBest.phy in FigTree. Click on "node labels" and select "support" as the node label. These are your bootstrap values.

Friday, October 16, 2015

Installing pbh5tools on OSX yosemite

I need to assemble PacBio amplicons of olfactory receptors. One of the main challenges I have been having is that most of the PacBio tools (SMRT-analysis, Celara assembler) focus on assembling whole genomes. Using an approach from Larson, et al. (2014) in BMC Genomics, I am going to do quality filtering using pbh5tools that will filter the reads at a minimum of passes (e.g. 4; though my mean number of passes is quite high for the data (~20)).

To install pbh5tools, I dealt with a series of challenges on Yosemite:
Download from github.
cd pbh5tools-master/
sudo python setup.py install

I ran into this pesky error.
In file included from /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include/numpy/ndarraytypes.h:1760:
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: 
      "Using deprecated NumPy API, disable it by "          "#defining
      NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
#warning "Using deprecated NumPy API, disable it by " \
 ^
In file included from /tmp/easy_install-_W_eCV/h5py-2.5.0/h5py/defs.c:279:
/tmp/easy_install-_W_eCV/h5py-2.5.0/h5py/api_compat.h:27:10: fatal error: 
      'hdf5.h' file not found
#include "hdf5.h"
         ^
1 warning and 1 error generated.

error: Setup script exited with error: command 'cc' failed with exit status 1
First I tries to install hdf5 tool packages and then export the path there, but that wasn't working.

From the setup.py, I see that it needs:
    install_requires=[
        'pbcore >= 0.8.0',
        'numpy >= 1.6.0',
        'h5py >= 1.3.0'
        ]

sudo pip install pbcore
sudo pip install numpy
sudo pip install h5py

Then,
sudo python setup.py install
...
Using /Library/Python/2.7/site-packages
Searching for pysam==0.8.3
Best match: pysam 0.8.3
pysam 0.8.3 is already the active version in easy-install.pth

Using /Library/Python/2.7/site-packages
Finished processing dependencies for pbh5tools==0.8.0
Voila!

Sunday, October 11, 2015

filing COEUS with DDIG at Stony Brook

Applying for an NSF DDIG for the first time is an experience full of lots of emotions. I tried to put together some advice to prevent a few less tears being shed.

One does not just simply "submit" a DDIG. It is a several week-long process full of joy and laughter, and the only way to prevent a stress-induced stomach ulcer is to sprinkle the experience with sarcastic remarks. Here is a short guide for getting yourself set-up in FASTLANE and submitting to COEUS (Stony Brook's version of the grant-submitting interface that mirrors online forms of the 1990s).

A true and serious piece of advice is to begin this two weeks before the deadline (at least). Believe me, procrastination will do you no favors. Do this before you have a draft together. The minimum you need to get started is the title of your project (which you can always change later). But you need to get started. Also, make sure you have your IACUC and IRB approval taken care of well in advance, because you will need this for COEUS as well.

[1] First, you must apply for an NSF FastLane login name. This is not something you sign up for on the NSF website, you have absolutely no control over this. Someone in your university will grant you this ID. This is different than the GFRP login if you are a GFRP fellow. Forget the beautifully designed FastLane submission form of the GFRP, where the sea-green trim borders all of the boxes and you actually may feel happy about life. Those days are in your past and you must now submit yourself to the harsh reality that the federal government does not have a budget for maintaining web submission forms post the early 2000s, and never will. To request your NSF ID (summarized in better detail here), you must e-mail your Department Grant Officer. For Ecology & Evolution, it is (at the time of writing) Gloria (and she is truly great!).

[2] Log in to FastLane. Go to https://www.fastlane.nsf.gov, click "Proposals, Awards and Status". Using your new login information, login. Click "Proposal Functions", "Proposal Preparation", and "Prepare Proposal". Also, update any of your FastLane information. For a DDIG, your adviser (the PI) must do this first. Sit with them (or do it for them), with your NSF ID, and create the new proposal.


[3] Log in to COEUS and begin a new proposal. Your adviser may have to do this first and then add you. You first must at least fill out the first page of creating a new proposal. Either do this for your adviser (before you do it for you) or sit with your adviser. Under the tab "Investigators/Key Persons", they can now add you as a Co-PI. As an added Co-PI, you need to fill out a series of questions to certify.
sunshine and happiness
An example of my login page
Things to pay attention to-- the start date for the NSF DEB begins at the earliest 05/01 of the next year following submission and can last for 24 months. Note that the end date could not be 05/01/2018, but must be 04/30/2018. The title must begin with "DISSERTATION RESERCH:......".

[4] Add the Grants Management Officer (in this case, Gloria) as an aggregator. Before you yourself can do anything, the PI (your adviser) must give you permission to add and edit the proposal. On the left, click on "Proposal Roles". Add yourself and Gloria as an aggregator. 
You add the permissions for the people under Aggregator, the Grants Officer takes care of the rest
[5] Fill out "Credit Split" tab. I really had no clue what was going on in this tab. Basically, give your adviser all of the credit and make sure the numbers add up to 100. 

[6] Under the "Special Review" tab, upload any IACUC of IRB approvals associated with your project (there are certain numbers to report associated with your project. 

[7] Update the "Investigators/Key Persons" tab to reflect the amount of "Effort" you will be putting into the project. Even though the DDIG has no salary associated with it, this still must be reported Remember that the university cares about nothing except money and it likes to know where it should be distributed over what frame of time, even if its not even there. 

Notice that the C is referring to Calendar Year %Effort, not months. 1 month Calendar Year is 8.33% (1 / 12mo); 1 month AY 11.11% (1 / 9mo); 1 month summer 33.33% (1 / 3mo). My advisers effort is 1 month out of 12 calendar months. Mine is 11 out of 12 calendar months. This is where you can mess up because this effort MUST match what you also put in your CURRENT & PENDING documents. You can see my approval is still in progress, as an effect of how I did manage to mess this up. :-|

[8]Answer all the random questions in the tabs under "Questionnaire". I do recall these being relatively painless. 

[9] Prepare the documents you need to submit to COEUS. Don't do anything in the budget tab. Instead, fill out the "Budget Worksheet", an excel file where you add everything and then upload it under "Upload Documents". There should be three tabs, each with different forms you need to update. These forms change all of the time so make sure from the COEUS forms tab, you are downloading the right ones. Some documents you can generate the template yourself.

An example of a successful DDIG with all of the parts can be found here. Feel free to contact me if you would like more Stony Brook-related examples of some of these documents. 
  • Under the tab "Upload Proposal Documents"
    • COEUS project summary: This can be the DDIG project summary format (Overview, Intellectual Merit, Broader Impacts) or if this is not ready yet, you can submit a random paragraph from your project that summarizes the broad objectives. 
    • DDIG Facilities statement: Basically, you summarize anything equipment, computing facilities, or laboratories you will be in your project. Make a statement about the Stony Brook Libraries too. See the example above for a model of (in my opinion, bare minimum) facilities page. I will be happy to provide an example of thorough one.
    • DDIG Budget Justification: A one-two page summary that includes a timeline of how the money will be used in the proposal. Don't go super crazy (e.g. not necessary to include the exact quote for a Qiagen extraction kit), but at least have some estimates of each experiment in the objectives of your proposal, estimates of fieldwork costs, and . Also include an IDC statement in your description of Indirect Costs. (Literally copy and paste the following sentence: Foundation for SUNY at Stony Brook's most current approved rate agreement negotiated through DHHS, their federal cognizant agency, dated Feb. 26, 2015.) The Indirect Cost estimates can be calculated from the Budget Worksheet. See the PDF posted above for an example of this document.
  • Under the "Upload Personnel Attachments" tab:
    • For each person on the proposal, add the NSF Biosketch. An updated (2015) set of guidelines for a Biosketch are here. You can see an example in the PDF I posted above, but also check the most recent guidelines because things have changed.
    • For each person, add the Current & Pending forms. This includes all grants that you have, all grants you are applying for, and all grants that you are thinking of applying for. 
      Be sure to include this proposal as "Pending" and that the title is exactly the same as the title on the cover page of this proposal. Also on the C&P only - After the title put (this proposal) in parenthesis. 
  • Under the "Upload Institutional Attachments" tab:
    • Internal COI document for EACH PI and Co-PI declaring any conflicts of interest, even if there are none. The more recent of this form is available on the COEUS website.
    • COEUS proposal form: Similar to the information declared under "Investigators/Key Persons" tab. This information must also match, as well as what was under C&P. This form is available to download on the COEUS webpage.
    • Budget Spreadsheet. This is a handy spreadsheet you can download from the COEUS webpage where you enter the values of your budget per year and then it estimates the Indirect Costs. This supplements doing anything under the Budget Tab in COEUS.
Once you have all of this uploaded to your COEUS, contact your Gloria-person and tell her you are ready to submit. She will check and tell you if you need to fix anything (you probably do, so this is why you need to start way in advance). Thinks like the Current & Pending, Biosketch, Budget Justification, and Facilities also need to be uploaded to FastLane. You need to fill out your budget on FastLane as well. 

When you are getting ready to submit your DDIG (this day will come), you will have to "Allow SRO access" on your FastLane account so that the Gloria person can review and submit (no, you can't do that yourself). I hope you have come to the realization that you have so little control over your life. 

Live, learn, endure. You can do it!

Thursday, October 1, 2015

Is deduplication for you?
Re: @brianbushnell on seqanswers:Deduplication of reads is possible with dedupe, but it requires a lot of memory (~1kb per read) so it is only practical with HiSeq data if you have high-memory nodes. If you have a reference, deduplication based on mapping will use much less memory, though it will also be much slower. And whether or not deduplication is a good idea depends on the experiment; i.e., probably not for quantitative RNA-seq, and also probably not for unamplified libraries in general.

Suggested order:
1. Initial quality control BGI did this
2. Quality and/or length-based reads discarding; trimming/discarding of N-containing reads BGI did this
3. Adapter removal BGI did this
4. phix/contaminant removal
5. [optional] Error-correction --use ecc.sh
6. [optional] Deduplication (depending on experiment) --use dedupe.sh
7. Merging of PE reads --use bbmerge.sh
8. Quality trimming + phix/contaminant removal --use bbduk.sh
9. Contamination check

More advice:
As for the order of quality trimming and merge, they could be done either way. But, BBMerge internally performs quality-trimming for reads that don't merge prior to quality trimming, so for that particular program, quality-trimming after merging tends to be better.

Note that after merging, you will have 2 sets of reads (merged and unmerged) that must be processed independently, since not all of them will successfully merge

Also useful reading about merging your paired end reads before assembly:
http://thegenomefactory.blogspot.com/2012/11/tools-to-merge-overlapping-paired-end.html

Okay, time to try...


Step 1: Quality trimming

#quality control trimming of paired-end reads
#DR086_E.bombifrons_MOE quality control and plots
bbduk.sh -Xmx1g in1=DR086_Erbom_MOE_1.fq in2=DR086_Erbom_MOE_2.fq out1=DR086_Erbom_MOE_1_clean.fq out2=DR086_Erbom_MOE_2_clean.fq minlen=25 qtrim=rl trimq=10 qhist=qc_plots/DR086_Erbom_MOE_qhist.txt aqhist=qc_plots/DR086_Erbom_MOE_aqhist.txt lhist=qc_plots/DR086_Erbom_MOE_readLength.txt bqhist=qc_plots/DR086_Erbom_MOE_bqhist.txt overwrite=true

About 8-9% of each read was trimmed.

Normal level sequence duplication in mammals is 20 million reads
Normal sequence bias at beginning of reads due to nonrandom hybridization of random primers.

#I put this all in a shell script
sh BGI_qc.sh >log.txt

I tried to do error correction but these files are HUGE!! Need to remove duplicates first.
sh ecc.sh in1=DR086_Erbom_MOE_1.fq in2=DR086_Erbom_MOE_2.fq out1=DR086_Erbom_MOE_1_corrected.fq out2=DR086_Erbom_MOE_2_corrected.fq
/usr/local/bin/bbmap//calcmem.sh: line 111: [: unlimited: integer expression expected
java -ea -Xmx-410m -Xms-410m -cp /usr/local/bin/bbmap/current/ jgi.KmerNormalize bits=16 ecc=t passes=1 keepall dr=f prefilter in1=DR086_Erbom_MOE_1.fq in2=DR086_Erbom_MOE_2.fq out1=DR086_Erbom_MOE_1_corrected.fq out2=DR086_Erbom_MOE_2_corrected.fq
Invalid maximum heap size: -Xmx-410m
Could not create the Java virtual machine.

#even when I increased the memory, no cigar
sh ecc.sh in1=DR086_Erbom_MOE_1.fq in2=DR086_Erbom_MOE_2.fq out1=DR086_Erbom_MOE_1_corrected.fq out2=DR086_Erbom_MOE_2_corrected.fq -Xmx20g
Exception in thread "Thread-60" java.lang.NoClassDefFoundError: java/util/concurrent/ThreadLocalRandom
        at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2807)

Step 2: Deduplication
dedupe.sh in=NBS1170_Desrot_MOE_1_clean.fq out=NBS1170_Desrot_MOE_1_clean_dedupe.fq am ac fo renameclusters=f storename=f pto cc qin=33 csf=stats.txt dot=graph.dot -Xmx20g overwrite=t
dedupe.sh in=NBS1170_Desrot_MOE_2_clean.fq out=NBS1170_Desrot_MOE_2_clean_dedupe.fq am ac fo renameclusters=f storename=f pto cc qin=33 csf=NBS1170_Desrot_MOE_2_clean_dedupe_stats.txt dot=NBS1170_Desrot_MOE_2_clean_dedupe_graph.dot -Xmx20g overwrite=t

Step 3: Merge Paired End Reads
#merge PE 
bbmerge.sh in1=NBS1170_Desrot_MOE_1_clean.fq.gz in2=NBS1170_Desrot_MOE_2_clean.fq.gz out=NBS1170_Desrot_MOE_merged.fq -Xmx20g

Step 3.5: Could also remove the duplicates after merged paired ends
#remove duplicates of the two lanes

dedupe.sh in=NBS1170_Desrot_MOE_merged.fq out=NBS1170_Desrot_MOE_merged_dedupe.fq am ac renameclusters=f storename=f pto cc qin=33 csf=dedupe_logs/NBS1170_Desrot_MOE_merged_dedupe_stats.txt dot=dedupe_logs/NBS1170_Desrot_MOE_merged_dedupe_stats_graph.dot -Xmx20g overwrite=t

Monday, August 31, 2015

I am going to make a gene tree of all the Trpc2 sequences to demonstrate how different the pseudogenes are from the conserved gene--the hypothesis being that the pseudogenes will have much longer branch lengths.

Running modelo
modelomatic_OSX Trpc2_modelomatic.phy bionj Trpc2_modelomatic_normal.txt 0 normal
---------------------------------------------------------------------------------------------
  ModelOMatic (v1.01 (release)).
A program for choosing substitutions models for phylogenetic inference.
Written by Simon Whelan.
Contributions from James Allen, Ben Blackburne and David Talavera.
---------------------------------------------------------------------------------------------
Data: <Trpc2_modelomatic.phy>: 118 sequences of length 489 (DataMatrix: 118 x 380)
Checking for sparse (>85% gaps) sequences
Starting with 118 ... after removal there are 118 sequences
Creating start tree ... 
Assertion failed: (0), function Error, file tools.cxx, line 235.
Unrecognised codon: TAG for data codon position 153 in sequence 52...Abort trap: 6
105-238:modelomatic loloyohe$ modelomatic_OSX Trpc2_modelomatic.phy bionj Trpc2_modelomatic_normal.txt 0 normal

---------------------------------------------------------------------------------------------
  ModelOMatic (v1.01 (release)).
A program for choosing substitutions models for phylogenetic inference.
Written by Simon Whelan.
Contributions from James Allen, Ben Blackburne and David Talavera.
---------------------------------------------------------------------------------------------
Data: <Trpc2_modelomatic.phy>: 118 sequences of length 489 (DataMatrix: 118 x 380)
Checking for sparse (>85% gaps) sequences
Starting with 118 ... after removal there are 118 sequences
Creating start tree ...  estimated using bionj (0.813652s)
Optimisation settings: normal 
Scanning for modelomatic.ini file ...
Amino acid models skipping rtREV, HIVb, HIVw
Codon models skipping F0, F1X4, F3X4, F64 done
Working with genetic code: Universal
>>> Doing model analysis <<< 
RY Done  (2.84812s)
NT Done  (65.4292s)
AA Done  (267.533s)...
Codon Done  (0.255969s)

Outputting results to <Trpc2_modelomatic_normal.txt>

Successful exit

For once, something worked. The model with the best fit is the K2P+4dG. This is the Kimura two-parameter + gamma with four site rates. A good summary of the nucleotide models can be found here. K2P means that the transitions (alpha) and transversions (beta) will have different rates, but that all nucleotides occur at the same frequency. This model can easily be implemented in GARLI. The K2P model is just a simpler HKY model--HKY allows for different transition and transversion rates, but also allows 4 parameters for nucleotide substitutions. I am using the GARLI on CIPRES. 

#everything is default except
Maximum hours to run=48
-searchreps=8
-bootstrapreps=0
-datatype=nucleotide
#to make it K2P make sure to change base statefrequencies
-ratematrix=HKY(2rate)
-statefrequencies=fixed
-invariant sites=none
#this is where the 4dG comes in from modelomatic
-ratehetmodel=gamma
-numratecats=4



ERROR: state frequencies specified as fixed, but no

        Garli block found in Trpc2_garli.phy!!

Oops, it choked. -statefrequencies should be set to equal:

#everything is default except
Maximum hours to run=48
-searchreps=8
-bootstrapreps=0
-datatype=nucleotide
#to make it K2P make sure to change base statefrequencies
-ratematrix=HKY(2rate)
-statefrequencies=equal
-invariant sites=none
#this is where the 4dG comes in from modelomatic
-ratehetmodel=gamma

-numratecats=4
From the best tree from GARLI:

Now onto testing if some branch lengths are significantly longer...