Showing posts with label garli. Show all posts
Showing posts with label garli. Show all posts

Monday, November 23, 2015

Find the best model of evolution for Trpc2 using modelomatic:
#For this analysis, all stop codons need to be removed.
#For troubleshooting modelOmatic, I wrote a more detailed post here.
#Navigate to the correct directory and run this command:
modelomatic_OSX Trpc2_alignment_noStops.phy bionj Trpc2_modelomatic_normal.txt 0 normal

Interestingly REV+4dG is the best model, despite it being a protein coding gene. This is essentially "GTR+G". The 4dG is showing the four rate categories (this is an arbitrary number) for the gamma parameter.

Find the "best" (ML) tree in GALI using search replicates:
#run this command
garli Trpc2_garli.conf
#Note how to implement GTR+G in GARLI. See here for more examples of other models.
#Here is my configuration file (Trpc2_garli.conf):
[general]
datafname = Trpc2_garli.nex
constraintfile = none
streefname = stepwise
attachmentspertaxon = 50
ofprefix = Trpc2.GTR4DG
randseed = -1
availablememory = 512
logevery = 10
saveevery = 100
refinestart = 1
outputeachbettertopology = 0
outputcurrentbesttopology = 0
enforcetermconditions = 1
genthreshfortopoterm = 20000
scorethreshforterm = 0.05
significanttopochange = 0.01
outputphyliptree = 0
outputmostlyuselessfiles = 0
writecheckpoints = 0
restart = 0
outgroup = 1
resampleproportion = 1.0
inferinternalstateprobs = 0
outputsitelikelihoods = 0
optimizeinputonly = 0
collapsebranches = 1

searchreps = 8
bootstrapreps = 0

[model1]
datatype = nucleotide
ratematrix = 6rate
statefrequencies = estimate
ratehetmodel = gamma
numratecats = 4
invariantsites = none

[master]
nindivs = 4
holdover = 1
selectionintensity = 0.5
holdoverpenalty = 0
stopgen = 5000000
stoptime = 5000000

startoptprec = 0.5
minoptprec = 0.01
numberofprecreductions = 10
treerejectionthreshold = 50.0
topoweight = 1.0
modweight = 0.05
brlenweight = 0.2
randnniweight = 0.1
randsprweight = 0.3
limsprweight =  0.6
intervallength = 100
intervalstostore = 5
limsprrange = 6
meanbrlenmuts = 5
gammashapebrlen = 1000
gammashapemodel = 1000
uniqueswapbias = 0.1
distanceswapbias = 1.0

Putting bootstrap values on GARLI:
You can bootstrap in GARLI, but to "put" the bootstraps on the nodes, you have to summarize the bootstrap results using an outside method. SumTrees.py in the Dendropy python package works well. I basically followed the exact instructions somebody wrote here.

Running bootstrap iterations in GARLI:
#I ran four independent runs simultaneously of 250 bootstrap iterations that will combine to = 1000.
#Here is my GARLI configuration file:
[general]
datafname = Trpc2_garli.nex
constraintfile = none
streefname = stepwise
attachmentspertaxon = 50
ofprefix = Trpc2.GTR4DG.boot
randseed = -1
availablememory = 512
logevery = 10
saveevery = 100
refinestart = 1
outputeachbettertopology = 0
outputcurrentbesttopology = 0
enforcetermconditions = 1
genthreshfortopoterm = 20000
scorethreshforterm = 0.05
significanttopochange = 0.01
outputphyliptree = 0
outputmostlyuselessfiles = 0
writecheckpoints = 0
restart = 0
outgroup = 1
resampleproportion = 1.0
inferinternalstateprobs = 0
outputsitelikelihoods = 0
optimizeinputonly = 0
collapsebranches = 1

searchreps = 1
bootstrapreps = 250

[model1]
datatype = nucleotide
ratematrix = 6rate
statefrequencies = estimate
ratehetmodel = gamma
numratecats = 4
invariantsites = none

[master]
nindivs = 4
holdover = 1
selectionintensity = 0.5
holdoverpenalty = 0
stopgen = 5000000
stoptime = 5000000

startoptprec = 0.5
minoptprec = 0.01
numberofprecreductions = 10
treerejectionthreshold = 20.0
topoweight = 1.0
modweight = 0.05
brlenweight = 0.2
randnniweight = 0.1
randsprweight = 0.3
limsprweight =  0.6
intervallength = 100
intervalstostore = 5
limsprrange = 6
meanbrlenmuts = 5
gammashapebrlen = 1000
gammashapemodel = 1000
uniqueswapbias = 0.1
distanceswapbias = 1.0

Installing dendropy
#We have "pip" set up on our server, making the python package easy to install:
sudo pip install -U dendropy

Running SumTrees
#place all of your bootstrap outputs and your best tree from GARLI in its own folder

cd /Volumes/Yango/Trpc2/garli/

mkdir sumtrees
cp Trpc2.GTR4DG.best.tre sumtrees/
cp Trpc2.GTR4DG.boot.boot.tre sumtrees/
cp Trpc2.GTR4DG.boot2.boot.tre sumtrees/
cp Trpc2.GTR4DG.boot3.boot.tre sumtrees/
cp Trpc2.GTR4DG.boot4.boot.tre sumtrees/

#from within the folder, run sumtrees.py
#your target tree is the "best" tree from your GARLI output

sumtrees.py --decimals=0 --percentages --no-analysis-metainformation --no-taxa-block Trpc2.GTR4DG.boot.boot.tre Trpc2.GTR4DG.boot2.boot.tre Trpc2.GTR4DG.boot3.boot.tre Trpc2.GTR4DG.boot4.boot.tre --target=Trpc2.GTR4DG.best.tre --output=supportOnBest.phy

Viewing tree with bootstraps
Now, open supportOnBest.phy in FigTree. Click on "node labels" and select "support" as the node label. These are your bootstrap values.

Monday, August 31, 2015

I am going to make a gene tree of all the Trpc2 sequences to demonstrate how different the pseudogenes are from the conserved gene--the hypothesis being that the pseudogenes will have much longer branch lengths.

Running modelo
modelomatic_OSX Trpc2_modelomatic.phy bionj Trpc2_modelomatic_normal.txt 0 normal
---------------------------------------------------------------------------------------------
  ModelOMatic (v1.01 (release)).
A program for choosing substitutions models for phylogenetic inference.
Written by Simon Whelan.
Contributions from James Allen, Ben Blackburne and David Talavera.
---------------------------------------------------------------------------------------------
Data: <Trpc2_modelomatic.phy>: 118 sequences of length 489 (DataMatrix: 118 x 380)
Checking for sparse (>85% gaps) sequences
Starting with 118 ... after removal there are 118 sequences
Creating start tree ... 
Assertion failed: (0), function Error, file tools.cxx, line 235.
Unrecognised codon: TAG for data codon position 153 in sequence 52...Abort trap: 6
105-238:modelomatic loloyohe$ modelomatic_OSX Trpc2_modelomatic.phy bionj Trpc2_modelomatic_normal.txt 0 normal

---------------------------------------------------------------------------------------------
  ModelOMatic (v1.01 (release)).
A program for choosing substitutions models for phylogenetic inference.
Written by Simon Whelan.
Contributions from James Allen, Ben Blackburne and David Talavera.
---------------------------------------------------------------------------------------------
Data: <Trpc2_modelomatic.phy>: 118 sequences of length 489 (DataMatrix: 118 x 380)
Checking for sparse (>85% gaps) sequences
Starting with 118 ... after removal there are 118 sequences
Creating start tree ...  estimated using bionj (0.813652s)
Optimisation settings: normal 
Scanning for modelomatic.ini file ...
Amino acid models skipping rtREV, HIVb, HIVw
Codon models skipping F0, F1X4, F3X4, F64 done
Working with genetic code: Universal
>>> Doing model analysis <<< 
RY Done  (2.84812s)
NT Done  (65.4292s)
AA Done  (267.533s)...
Codon Done  (0.255969s)

Outputting results to <Trpc2_modelomatic_normal.txt>

Successful exit

For once, something worked. The model with the best fit is the K2P+4dG. This is the Kimura two-parameter + gamma with four site rates. A good summary of the nucleotide models can be found here. K2P means that the transitions (alpha) and transversions (beta) will have different rates, but that all nucleotides occur at the same frequency. This model can easily be implemented in GARLI. The K2P model is just a simpler HKY model--HKY allows for different transition and transversion rates, but also allows 4 parameters for nucleotide substitutions. I am using the GARLI on CIPRES. 

#everything is default except
Maximum hours to run=48
-searchreps=8
-bootstrapreps=0
-datatype=nucleotide
#to make it K2P make sure to change base statefrequencies
-ratematrix=HKY(2rate)
-statefrequencies=fixed
-invariant sites=none
#this is where the 4dG comes in from modelomatic
-ratehetmodel=gamma
-numratecats=4



ERROR: state frequencies specified as fixed, but no

        Garli block found in Trpc2_garli.phy!!

Oops, it choked. -statefrequencies should be set to equal:

#everything is default except
Maximum hours to run=48
-searchreps=8
-bootstrapreps=0
-datatype=nucleotide
#to make it K2P make sure to change base statefrequencies
-ratematrix=HKY(2rate)
-statefrequencies=equal
-invariant sites=none
#this is where the 4dG comes in from modelomatic
-ratehetmodel=gamma

-numratecats=4
From the best tree from GARLI:

Now onto testing if some branch lengths are significantly longer...

Saturday, July 18, 2015

After running GARLI, I wanted to compare
OR137:
OR213: lots of disagreements; redo with 100
OR4: lots of disagreements; redo with 100
OR589: lot
OR6: all agree after 8 runs
OR10: all agree after 8 runs
OR11: tree2 disagrees, 7 others agree, use this as main gene tree
OR51: lots of disagreements; redo with 100
OR52: lots of disagreements; redo with 100

100 is a lot of tree to compare, so I wrote a small script doing so:

setwd("/Volumes/Yango/Hayden_batORs/garli/OR51/trees")

#tell where to open up all of your downloaded files from CIPRES
my.files<-list.files("/Volumes/Yango/Hayden_batORs/garli/OR51/trees/")
#read in all of these trees to a list using read.nexus function
my.trees<-lapply(my.files, read.nexus)

#create a matrix to store all of the tree comparisons
tree.matrix<-matrix(nrow = length(my.trees), ncol = length(my.trees))

for(i in 1:length(my.trees)){ #for every tree in the file
  this.tree<-as.phylo(my.trees[[i]]) #look at that tree
  #compare the topology to all the other trees
  tree.matrix[i,]<-sapply(my.trees, function(my.trees) dist.topo(this.tree,my.trees))
}
tree.matrix

#print out the ones that agree with each other
for(i in 1:length(tree.matrix[1,])){
  this.list<-as.numeric(which(tree.matrix[i,]==0))
  print(this.list)
}
write.csv(tree.matrix, "tree.matrix.csv")

For OR51, are things looking better?
[1] 1
[1] 2
[1] 3
[1]  4 84
[1] 5
[1] 6
[1] 7
[1]  8 15 41 79
[1]  9 14
[1] 10 54
[1] 11
[1] 12
[1] 13 73
[1]  9 14
[1]  8 15 41 79
[1] 16
[1] 17 20 75
[1] 18
[1] 19
[1] 17 20 75
[1] 21 51 68
[1] 22 37
[1] 23 50 71
[1] 24
[1] 25
[1] 26
[1] 27
[1] 28 69 88
[1] 29
[1] 30
[1] 31
[1] 32 46
[1] 33
[1] 34 70
[1] 35 53
[1] 36
[1] 22 37
[1] 38 59
[1] 39 47 78
[1] 40
[1]  8 15 41 79
[1] 42
[1] 43
[1] 44
[1] 45
[1] 32 46
[1] 39 47 78
[1] 48
[1] 49
[1] 23 50 71
[1] 21 51 68
[1] 52
[1] 35 53
[1] 10 54
[1] 55
[1] 56
[1] 57
[1] 58
[1] 38 59
[1] 60
[1] 61
[1] 62 86 87
[1] 63
[1] 64
[1] 65
[1] 66 95
[1] 67
[1] 21 51 68
[1] 28 69 88
[1] 34 70
[1] 23 50 71
[1] 72
[1] 13 73
[1] 74
[1] 17 20 75
[1] 76
[1] 77
[1] 39 47 78
[1]  8 15 41 79
[1] 80
[1] 81
[1] 82
[1] 83
[1]  4 84
[1] 85
[1] 62 86 87
[1] 62 86 87
[1] 28 69 88
[1] 89
[1] 90
[1] 91
[1] 92
[1] 93
[1] 94
[1] 66 95
[1] 96
[1] 97
[1] 98

Womp, womp...not really--only a maximum of 4 trees out of 100 are agreeing.

OR4s looking much better:
[1]  1 22 28 42 75
[1]  2 99
[1]  3 14 18 83 89
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
[1] 5
[1] 6
[1]  7 15 98
[1] 8
[1]  9 19 38 80
[1] 10 50
[1] 11 67
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
[1] 13 86
[1]  3 14 18 83 89
[1]  7 15 98
[1] 16
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
[1]  3 14 18 83 89
[1]  9 19 38 80
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
[1] 21 31 44 62 84
[1]  1 22 28 42 75
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
[1] 25
[1] 26
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
[1]  1 22 28 42 75
[1] 29
[1] 30 33
[1] 21 31 44 62 84
[1] 32 37 61 68 81
[1] 30 33
[1]  34 100
[1] 35
[1] 36
[1] 32 37 61 68 81
[1]  9 19 38 80
[1] 39 71
[1] 40 49
[1] 41
[1]  1 22 28 42 75
[1] 43
[1] 21 31 44 62 84
[1] 45
[1] 46
[1] 47
[1] 48
[1] 40 49
[1] 10 50
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
[1] 52
[1] 53
[1] 54
[1] 55
[1] 56 90
[1] 57
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
[1] 60
[1] 32 37 61 68 81
[1] 21 31 44 62 84
[1] 63
[1] 64
[1] 65
[1] 66
[1] 11 67
[1] 32 37 61 68 81
[1] 69
[1] 70
[1] 39 71
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
[1] 73
[1] 74
[1]  1 22 28 42 75
[1] 76
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
[1] 78
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
[1]  9 19 38 80
[1] 32 37 61 68 81
[1] 82
[1]  3 14 18 83 89
[1] 21 31 44 62 84
[1] 85
[1] 13 86
[1] 87
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
[1]  3 14 18 83 89
[1] 56 90
[1] 91
[1] 92
[1] 93
[1] 94
[1] 95
 [1]  4 12 17 20 23 24 27 51 58 59 72 77 79 88 96
[1] 97
[1]  7 15 98
[1]  2 99
[1]  34 100

Wednesday, June 24, 2015

Running GARLI from our Model-O-Matic code.

If we remember from our model-o-matic output and sort by the lowest AIC, we can see the best model for OR6 is CodonM0.0.EQU+4dG. It is interesting how all codon models are preferred to the nucleotide models! This turned out to be the case in every set of alignments.




Things are looking good. I'll keep you posted on the results.

Here is the following best-fit models for each of the OR gene families (see summary files with AIC and likelihoods in /Volumes/Yango/Hayden_batORs/model_o_matic/excel_summaries):

OR137      CodonM0.0.EQU+4dG
OR213      CodonM0.0F64+4dG
OR4          CodonM0.0.F64+4dG
OR589      CodonM0.0.F64+4dG
OR6          CodonM0.0.EQU + 4dG
OR10        CodonM0.0.F64+4dG
OR11        CodonM0.0F64+4dG
OR51        CodonM0.0F64+4dG
OR52        CodonM0.0F64+4dG

Those highlighted in pink were run in model-O-matic as the -100normal parameter, as the program would crash if it was run as -normal. My thoughts are there are too many taxa, but other than than that I am at a loss.

Now we can apply this model to GARLI using our tree output. There are two options for running GARLI (at least that I use). One is on the CIPRES server, the other can be from command line. I am currently having CPU usage problems on CIPRES so I am going to run it from command.

Installing GARLI is easy if you have homebrew.
105-238:~ loloyohe$ brew install garli
Should be ready to go.

Let's take a second to interpret the output from

105-238:OR6 loloyohe$ garli garli.conf
Running GARLI Version 2.01.1067 (18 May 2012)
->Single processor version for 64-bit OS<-
##############################################################
 This is GARLI 2.0, the first "official" release including 
          partitioned models.  It is a merging of
   official release 1.0 and beta version GARLI-PART 0.97
  Briefly, it includes models for nucleotides, amino acids,
 codons, and morphology-like characters, any of which can be 
  mixed together and applied to different subsets of data.

    General program usage is extensively documented here:
            http://www.nescent.org/wg/garli/
      see this page for details on partitioned usage:
  http://www.nescent.org/wg/garli/Partition_testing_version
   and this page for details on Mkv mophology model usage:
    http://www.nescent.org/wg/garli/Mkv_morphology_model
         PLEASE LET ME KNOW OF ANY PROBLEMS AT:
                garli.support@gmail.com
##############################################################
Compiled Apr 27 2015 16:06:23 using GNU gcc compiler version 4.9.2
Using NCL version 2.1.17

#######################################################
Reading config file garli.conf
###################################################
READING OF DATA
Attempting to read data file in Nexus format (using NCL):
Phyllos_OR6_Funct_tAlign_noStops.nex ...
Reading DATA block...storing implied block: TAXA
storing read block: DATA
 successful

###################################################
PARTITIONING OF DATA AND MODELS
GARLI data subset 1
CHARACTERS block #1 ("Untitled DATA Block 1")
Data read as Nucleotide data,
modeled as Codon data
Gaps or ambiguity codes found within codon for taxon Lsi_29_OR6.
Codons coded as missing for that taxon: 6 120 235 236 
Gaps or ambiguity codes found within codon for taxon Phe_43_OR6.
Codons coded as missing for that taxon: 6 
Gaps or ambiguity codes found within codon for taxon Sti_17_OR6.
Codons coded as missing for that taxon: 185 
Summary of data:
  35 sequences.
  6 constant characters.
  232 parsimony-informative characters.
  5 uninformative variable characters.
  243 total characters.
  243 unique patterns in compressed data matrix.
Pattern processing required < 1 second


###################################################
NOTE: Unlike many programs, the amount of system memory that Garli will
use can be controlled by the user.
(This comes from the availablememory setting in the configuration file.
Availablememory should NOT be set to more than the actual amount of 
physical memory that your computer has installed)

For this dataset:
 Mem level availablememory setting
  great     >= 150 MB
  good approx 149 MB to 99 MB
  low approx 98 MB to 42 MB
  very low approx 42 MB to 31 MB
the minimum required availablememory is 31 MB

You specified that Garli should use at most 512.0 MB of memory.

Garli will actually use approx. 224.4 MB of memory
**Your memory level is: great (you don't need to change anything)**

#######################################################
Found outgroup specification:  1

#######################################################
STARTING RUN

>>>Search rep 1 (of 2)<<<
MODEL REPORT - Parameters are at their INITIAL values (not yet optimized)
Model 1
  Number of states = 61 (codon data, standard code)
  Nucleotide Relative Rate Matrix Assumed by Codon Model:     2 rates (transition and transversion) K param = 4.0000
  Equilibrium State Frequencies: equal (1/61 = 0.01639, fixed)
  Rate Heterogeneity Model:
    4 nonsynonymous rate categories, rate and proportion of each estimated
     (this is effectively the M3 model of PAML)
      dN/dS Proportion
      0.2500 0.2500
      0.5000 0.2500
      0.7500 0.2500
      1.0000 0.2500

Starting with seed=287687

creating likelihood stepwise addition starting tree...
number of taxa added:
 4  5  6  7  8  9  10  11  12  13  14  15  16  17

We can set the exact same thing up in CIPRES. 

Monday, April 27, 2015

installing garli on yosemite

sudo gem update --system
brew install open-mpi --cc=gcc-4.9 --build-from-source
brew remove ncl garli
brew install ncl garli --cc=gcc-4.9 --build-from-source