Thursday, July 31, 2014



#installing bamm
brew tap macroevolution/bamm
brew install bamm

#it worked

Wednesday, July 30, 2014

Because I did such a horrid job taking comprehensible notes while in London, I am going to be more efficient this time around with transcriptome assembly.

We have our illumina RNA-seq reads back (yay!).

To download the reads from UArizona, you have to download this horridly complicated program called irods (http://uagc.arl.arizona.edu/resources/user-guide). After much struggle, the easiest way to get things installed and compiled is the "non-fancy" way. HOWEVER, you MUST download the most recent version of iRods (as of right now 3.3.1)

Navigate to your iRODs folder. Things in blue are what you type. Things in pink are what you would change differently based on your computer.


 export PATH=$PATH:$HOME/iRODS/clients/icommands/bin
 iinit #this will prompt you for a bunch of information

Enter the host name (DNS) of the server to connect to:  mozart.arl.arizona.edu
Enter the port number:  1247
Enter your irods user name:  <your login name from UAGC/BCF>
Enter your irods zone:  BCFZone
Enter your current iRODS password:  <your password from UAGC/BCF>
 
#now your session should start
ils #lists all of the folders in your iRODS account
icd 14140709_SN885_0197_Lane1SideA_Project_Davalos_RNA #change directory
iget -r 140709_SN885_0197_Lane1SideA_Project_Davalos_RNA /Volumes/Spare/ 
#iget -r copies the folder you want to whatever directory you want

Okay now all of the data should be on your computer.

Now I need to install Trinity:
#install homebrew if you do not already have it

#now install programs necessary for trinity to run on your mac
brew install homebrew/science/bowtie
brew install git
#to allow for new installation of compiler
brew tap homebrew/dupes
brew install apple-gcc42

#okay guess I can't do that because it requires an operating system higher than Mac OSX Lion...

Yes, you read that correctly. Our operating system is still pre-lion. Seems like I will have to wait until our operating system is figured out.

Once we figure this out though, remember this link: http://sourceforge.net/p/trinityrnaseq/mailman/message/31892980/
It allows for trinity to be installed on Mac (even though it is compiled for Linux).


Tuesday, July 15, 2014

This is from awhile ago--I found it in my drafts, but I just wanted to reopen this can of worms.
selecting priors:

1) select random node on the tree:
#randomly sample a node from a randomly trimmed tree
random.node<-sample(tree.trim$edge[1,], 1)
> random.node
[1] 75

2) open up mesquite and simulate random character in Brownian motion on the tree and look at before and after rates on node 75 in Mesquite using these instructions:
https://nunn.rc.fas.harvard.edu/groups/pica/revisions/7b932/24/
https://nunn.rc.fas.harvard.edu/groups/pica/wiki/f4cc5/43_Squared_Change_Parsimony.html

Take the rate at node 75 for sig1 1 prior: -0.86245
Pick random branch within this clade for sig2 prior (Pellorneum_pyrrogenys):1.1092
Alpha: random uniform distribution runif(1)
rand.shift = 0.05 --used what Revell, et al uses
use ngen=100,000
sample:100



Friday, July 11, 2014

Visit: http://ec2-54-88-69-219.compute-1.amazonaws.com:8080/smrtportal/

Login.

To create a new job:
Click "Create New"
Click "de Novo Assembly" and "Data Prep" then click "Next"

The data for C. sowelli is already loaded as a SMRT cell. Start a new job by naming the job, selecting the protocol, and choose "C. sowelli_amp" and move it from SMRT Cells available to "SMRT Cells in Job". Click "save" and "start"at the bottom right-hand corner.

To view the failed jobs, click on the job. On the right hand side, click "View Log".

I have tried RS_HGAP Assembly 2 for quality, RS_HGAP 3 for speed, and a Long Amplicon assembly and all three have failed and it is not clear why because it seems to start out filtering with no problem. 

Wednesday, July 2, 2014

Plans for today:

[1] Find missing data for high school student
[2] Figure out how to teach RaxML to a high school student (JW slides) COMPLETE
[3] Get Gazey-Staley algorithm set up in R and running. COMPLETE
      -Okay, so we don't actually have to do this step, as its just a measure of cloning success (and since we didn't clone. We don't really need to measure this. What we do need to measure (but not right away, is the # of sequences (reads) v. # genes (unique contigs). This will tell us if we have sampled enough of the genome for the OR genes given our degenerate primers.

Anyways, to download the R code for the Gazey-Staley algorithm along with a heterogeneity test (with very good documentation), visit:
http://batlab.ucd.ie/~spuechmaille/

[4] Figure out most recent problem with babbler paper.
[5] Set up ORA pipeline COMPLETE
____________________________________________
To install ORA (after having install BioPerl, HMMER, and FASTA), download the .zip file from:
http://search.cpan.org/~ceratites/ora-1.9.1/lib/Bio/ORA.pm#DRIVER_SCRIPT

Unzip and navigate to the ORA folder.

#run the perl script to make the "makefile"

perl Makefile.PL
sudo make install
perl Build.PL

WOOOOHOOOO got it working.
There are two main issues (assuming you have everything installed).
1) You must run the perl script in the /ora-*/scripts folder because it must access the .hmm files
2) When you install ORA, it includes the older version of HMMER. The error message you get when you run "or.pl", is: 
...
binary auxfiles are in an outdated HMMER format (3/b); please hmmpress your HMM file again

To fix this, do the following (assuming you have HMMER installed).
-In the /ora-*/scripts folder, there should be several files related to HMMER
Delete the following: or.hmm.h3f     or.hmm.h3i     or.hmm.h3m     or.hmm.h3p
On the command line, type:
hmmpress or.hmm

That should fix things.

To run the ORA:
perl or.pl --sequence ~/Documents/Analyses/Carrollia_OR/GPC-trim.deg.fasta
#still playing with all of the fun parameters

Now I need to trim the data of a certain length.
By the end of the weekend, it would be great to make a tree from the sequences.


Something to still think about regarding my conservation manuscript:

Our result:
Higher fragmentation of widely distributed species compared to those that are more narrowly distributed

"I agree with this result. In a more general biogeographic sense, isn’t this expected given that a larger area has a greater probability of being fragmented than a smaller one (in the same way that a short sequence has lower probability of acquiring a mutation than a long one)? Did you scale for that somehow (have no idea how, but we can talk about it)?"

We compared proportions of distributions covered by high HII. Not sure if this counts as scaling...
New rerun of .spec file



# original asm settings
utgErrorRate = 0.02
utgErrorLimit = 2 #default is 4.5

#defaults from the manual
cnsErrorRate = 0.02
cgwErrorRate = 0.10 #this is for the scaffold so just leaving
ovlErrorRate = 0.02

merSize=14

merylMemory = 128000
merylThreads = 16

ovlStoreMemory = 8192

# grid info
useGrid = 0
scriptOnGrid = 0
frgCorrOnGrid = 0
ovlCorrOnGrid = 0

sge = -A assembly
sgeScript = -pe threads 16
sgeConsensus = -pe threads 1
sgeOverlap = -pe threads 2
sgeFragmentCorrection = -pe threads 2
sgeOverlapCorrection = -pe threads 1

#ovlMemory=8GB --hashload 0.7
ovlHashBits = 25
ovlThreads = 2
ovlHashBlockLength = 20000000
ovlRefBlockSize = 50000000

# for mer overlapper
merCompression = 1
merOverlapperSeedBatchSize = 500000
merOverlapperExtendBatchSize = 250000

frgCorrThreads = 2
frgCorrBatchSize = 100000

ovlCorrBatchSize = 100000

# non-Grid settings, if you set useGrid to 0 above these will be used
merylMemory = 128000
merylThreads = 4

ovlStoreMemory = 8192

ovlConcurrency = 8

cnsConcurrency = 8

merOverlapperThreads = 3
merOverlapperSeedConcurrency = 3
merOverlapperExtendConcurrency = 3

frgCorrConcurrency = 2
ovlCorrConcurrency = 4
cnsConcurrency = 4

Here is the summary of the PacBio reads that I am working with:

Reads: 38,703
GPC1 reads: 18,182
GPC2 reads: 20,521
Mean read length: 627

Read Quality: 0.98 (~2% error)

I am going to rerun the assembly with 2% error.

Because most of the reads are so long and ORs are technically around 900bp, Geneious could essentially align the sequences from F and R primers and there is a good chance that we will get assembled OR sequences (which we can because my adviser and I did it yesterday).

In a way this makes more sense since the goal of most Assembly programs are to assemble a genome, making a scaffold. However, we have subgenome sequences that are scattered throughout the genome. Basically, it is assembly many genes at once but doesn't need to put them in one big scaffold string.

However, I am still more comfortable with using Celera for two reasons:
1) Sequences that are too repetitive are removed. Celera is an algorithm separates the gene sequences into either unique "unitigs" to be set as seed scaffold. The repetitive or non-unique "unitigs" are saved for later to try to overlap on the seed scaffold. Because we don't really care about the scaffold, the output from Celera that we do care about is actually the degenerate unitigs, which are unique contigs (more than one read) that do not fit into a scaffold, highlighting that they are unique genes. After this, I plan to filter the degenerate unitigs by length. Then, things should be ready for the ORA pipeline.

2) You can easily control the error-correction parameters.

Tuesday, July 1, 2014

Things to do today:
[1] Fill out the CDC form for Peru COMPLETE
[2] Fix the .spec file for assembly COMPLETE
I am trying this now (.spec file):

# original asm settings
utgErrorRate = 0.05
utgErrorLimit = 4.5

#defaults from the manual
cnsErrorRate = 0.06
cgwErrorRate = 0.10
ovlErrorRate = 0.06

merSize=14

merylMemory = 128000
merylThreads = 16

ovlStoreMemory = 8192

# grid info
useGrid = 0
scriptOnGrid = 0
frgCorrOnGrid = 0
ovlCorrOnGrid = 0

sge = -A assembly
sgeScript = -pe threads 16
sgeConsensus = -pe threads 1
sgeOverlap = -pe threads 2
sgeFragmentCorrection = -pe threads 2
sgeOverlapCorrection = -pe threads 1

#ovlMemory=8GB --hashload 0.7
ovlHashBits = 25
ovlThreads = 2
ovlHashBlockLength = 20000000
ovlRefBlockSize = 50000000

# for mer overlapper
merCompression = 1
merOverlapperSeedBatchSize = 500000
merOverlapperExtendBatchSize = 250000

frgCorrThreads = 2
frgCorrBatchSize = 100000

ovlCorrBatchSize = 100000

# non-Grid settings, if you set useGrid to 0 above these will be used
merylMemory = 128000
merylThreads = 4

ovlStoreMemory = 8192

ovlConcurrency = 8

cnsConcurrency = 8

merOverlapperThreads = 3
merOverlapperSeedConcurrency = 3
merOverlapperExtendConcurrency = 3

frgCorrConcurrency = 2
ovlCorrConcurrency = 4
cnsConcurrency = 4

I took out the "stopAfter=overlapper"...because that interrupted. Not sure why we would want to do that.

[3] Write syllabus for high school student COMPLETE [7/2/2014]
[4] Look at Trpc2 data big Mac broken :[
[5] Enter grades into blackboard COMPLETE
[6] submit conservation paper COMPLETE [7/2/1/2014]
--------------------------------------------------