Wednesday, January 21, 2015

Downloading new transcriptome data from BGI!

#Enter the ftp address in terminal window
ftp 128.120.88.244
Connected to 128.120.88.244.

220 (vsFTPd 2.2.2)

#input given username and password
#copy from ftp into your current directory
get DR_086_MO_E_1.fq.gz
get DR_086_MO_E_2.fq.gz

#or we can get multiple files
mget DR_086_VN*.fq.gz

#or we can get all the files!
mget *.fq.gz
exit

Now we can unzip the data, concatenate the two files for each species, and run the quality control filter

Wednesday, December 3, 2014


Gave in and installed the new blast+ using homebrew.

The sequence can have no newlines and needs to have a FASTA specific header:
head /Volumes/Ruficollis/ORA_annotations/C_sowelli_ORs_nonewline.fasta
>gb|deg7180000002010|OR7
CTGCACTCACCTATG....
>gb|deg7180000002011|OR51_PSEUDOGENE

sudo makeblastdb -in /Volumes/Ruficollis/ORA_annotations/C_sowelli_ORs_nonewline.fasta -out C_sowelli_ORS_blastdb -dbtype nucl -parse_seqids


awk '{if (substr($0,1,1)==">"){print "\n"$0} else printf("%s",$0);p++;}END{print "\n"}' DR_091_Mobl_MOE_assembled.fasta > joined.fasta

renamed_DR_013_Mored_MOE fasta:
>gnl|some_ID|gene129

Still didn't work.
OKAY GRUMBLE!!!! BLAST AND ITS ARCHAIC WAY OF LIFE >_<

perl ~/ora-1.9.1/scripts/or.pl -sequence DR_004_Arjam_MOE_assembled.part-01.fasta -a -d 

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: die in _initialize, hmm profile not found at /Volumes/Ruficollis/ORA_annotations/to_analyze/or.hmm

It doesn't like not being in its home directory when running the perl script.
perl or.pl -sequence /Volumes/Ruficollis/ORA_annotations/to_analyze/DR_004_Arjam_MOE_assembled.part-01.fasta -a -d >/Volumes/Ruficollis/ORA_annotations/to_analyze/to_cat/DR_004_ORs_pt1.fasta

Okay so this finally worked. Now time to get hacky. With all of my sequence data I get this annoying perl file open error:


Too many open files at /Library/Perl/5.16/Bio/ORA.pm line 524.

ulimit -a
The OS limits the number of files that perl can open at one time.
ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 256
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 709
virtual memory          (kbytes, -v) unlimited

We see that the max is 256. From what I've read on a Mac OSX the max is 1024--but PLEASE someone correct me if I'm wrong.

ulimit -n 1024

Just to be safe, I am going to split my sequence data into 100 fasta files each:
perl ~/Scripts/fasta-splitter.pl DR_004_Arjam_MOE_assembled.fasta -n-parts 100
perl ~/Scripts/fasta-splitter.pl DR_011_Arjam_VNO_assembled.fasta -n-parts 100
perl ~/Scripts/fasta-splitter.pl DR_013_Mored_MOE_assembled.fasta -n-parts 100
perl ~/Scripts/fasta-splitter.pl DR_013_Mored_VNO_assembled.fasta -n-parts 100
perl ~/Scripts/fasta-splitter.pl DR_091_Mobl_MOE_assembled.fasta -n-parts 100

Then run the ORA pipeline in a shell script for each file. AND THEN cat them all together. Yes, I know..very hacky.

#this must be run from within the ORA scripts folder so it can access all of the HMM folders at once

sh run_ORA_DR_004MOE.sh

#What does this shell script look like?
#!/bin/bash
perl or.pl -sequence /Volumes/Ruficollis/ORA_annotations/to_analyze/DR_004_Arjam_MOE_assembled.part-001.fasta -a -d >/Volumes/Ruficollis/ORA_annotations/to_analyze/to_cat/DR_004_ORs_pt1.fasta
perl or.pl -sequence /Volumes/Ruficollis/ORA_annotations/to_analyze/DR_004_Arjam_MOE_assembled.part-002.fasta -a -d >/Volumes/Ruficollis/ORA_annotations/to_analyze/to_cat/DR_004_ORs_pt2.fasta
perl or.pl -sequence /Volumes/Ruficollis/ORA_annotations/to_analyze/DR_004_Arjam_MOE_assembled.part-003.fasta -a -d >/Volumes/Ruficollis/ORA_annotations/to_analyze/to_cat/DR_004_ORs_pt3.fasta
....
perl or.pl -sequence /Volumes/Ruficollis/ORA_annotations/to_analyze/DR_004_Arjam_MOE_assembled.part-100.fasta -a -d >/Volumes/Ruficollis/ORA_annotations/to_analyze/to_cat/DR_004_ORs_pt100.fasta

cat /Volumes/Ruficollis/ORA_annotations/to_analyze/to_cat/DR_004_ORs* >> /Volumes/Ruficollis/ORA_annotations/DR_004_Arjam_MOE_ORs.fasta

Okay if that doesn't do the trick then I will be letting out another very loud grumble!

Wednesday, November 19, 2014

Problem troubleshooted while installing Phylobayes on Mavericks by LM Davalos:

Problem: Parallel phylobayes (pb_mpi1.5a) does not compile from source on multi-processor mac running OS 10.9. After clearing out any errors having to do with open-mpi, these errors persist:

file BP2util.h
shell: fatal error: 'tr1/unordered_map' file not found
#include <tr1/unordered_map>

original #include <tr1/unordered_map> #tr1 is deprecated in 10.9 and will not compile
modified: #include <unordered_map>

shell: error: use of undeclared identifier 'tr1'
original: class BipartitionHashTable : public tr1::unordered_map<string,T> {};
modified: class BipartitionHashTable : public unordered_map<string,T> {};

file BP2Stat.h
shell: fatal error: 'tr1/unordered_map' file not found
#include <tr1/unordered_map>

original: #include <tr1/unordered_map>
modified: #include <unordered_map>
#define hash_map std::unordered_map

file PolyNode.cpp
shell: error: variable length array of non-POD element type 'string'
original:  string retval[degree]; #line 431
modified:  string *retval = new string[degree]; #line 431

Wednesday, September 24, 2014

perl ~/Scripts/blast_parser.pl <DR_004_Arjam_MOE_assembled_v_OR_VR_indices.tblastx >DR_004_Arjam_MOE_assembled_v_OR_VR_indices.tblastx.parsed2
OK, processed 8019 queries, 23298 hits

perl ~/Scripts/blast_parser.pl <DR_011_Arjam_VNO_assembled_v_OR_VR_indices.tblastx  >DR_011_Arjam_VNO_assembled_v_OR_VR_indices.tblastx.parsed2
OK, processed 8019 queries, 77307 hits

perl ~/Scripts/blast_parser.pl <DR_013_Mored_MOE_assembled_v_OR_VR_indices.tblastx  >DR_013_Arjam_MOE_assembled_v_OR_VR_indices.tblastx.parsed2
OK, processed 8019 queries, 251312 hits

perl ~/Scripts/blast_parser.pl <DR_013_Mored_VNO_assembled_v_OR_VR_indices.tblast  >DR_013_Arjam_VNO_assembled_v_OR_VR_indices.tblastx.parsed2
OK, processed 8019 queries, 232694 hits

perl ~/Scripts/blast_parser.pl <DR_091_Mobl_MOE_assembled_v_OR_VR_indices.tblastx  >DR_091_Mobl_MOE_assembled_v_OR_VR_indices.tblastx.parsed2
OK, processed 8019 queries, 96314 hits

Tuesday, September 16, 2014

#remove newlines
awk '{if (substr($0,1,1)==">"){print "\n"$0} else printf("%s",$0);p++;}END{print "\n"}' DR_091_Mobl_MOE_assembled.fasta > joined.fasta

If I try running BLASTALL with my transcriptome data as the database, I get this error:

[blastall 2.2.22] ERROR: SeqPortNew: lcl

All the names of the sequences are too similar so I have to rename them--I am just going to rename them by number.
awk '/^>/{print ">" ++i; next}{print}' < DR_091_Mobl_MOE_assembled.fasta > renamed_DR_091_Mobl_MOE_assembled.fasta

#export the path the blast
export PATH=$PATH:/Users/laurelyohe/blast-2.2.22/bin/

#format the database 
#because the transcriptome is bigger than the indices, it will serve as the database
formatdb -i renamed_DR_091_Mobl_MOE_assembled.fasta -o T -p F
#run blast
blastall -p tblastx -d renamed_DR_013_Mored_MOE_assembled.fasta -i OR_VR_indices.fasta -e 1e-06 -v 5 -b 5 -a 2 -o ./DR_013_Mored_MOE_assembled_v_OR_VR_indices.tblastx

#now we need to parse the BLAST script
perl ~/Scripts/parse_bls.pl --i DR_091_Mobl_MOE_assembled_v_OR_VR_indices.tblastx > DR_091_Mobl_MOE_assembled_v_OR_VR_indices.tblastx.parsed
#next combine the parsed file with the sequences
~/Scripts/extract_hsps_new.pl -i renamed_DR_091_Mobl_MOE_assembled.fasta -t DR_091_Mobl_MOE_assembled_v_OR_VR_indices.tblastx.parsed -s DR_091_Mormoops_blainvelli
Okay, I have finally accepted that my data is not fitting an Ornstein-Uhlenbeck model of evolution and it makes zero sense to be running the tests that I have been running. My alpha value is way too small for all of my traits compared to the height of my tree.

I have run the fitContinuous() function in the "geiger" package on the geometric mean.

Thursday, September 4, 2014