Thursday, May 29, 2014

Working in Dr. Stephen Rossiter's lab to learn to assemble my transcriptome data.

[1] Collect all relevant sequences to help identify reads as genes
>in Ensemble:
-in BioMart, select database "Microbat"
-export the gene IDs for olfactory receptors (set this in "Filters")--this will make an excel sheet of IDs that you will use to sort the attributes
-under "Attributes" -> Sequences-> Check "Ensembl GeneID", "Associated Gene Name", "Ensembl transcript ID", and "coding sequences"
-under "Features"-> Check "Ensembl Gene ID", "Ensembl Transcript ID"

>in Genbank: 
-try to find link to accession numbers in paper that link to a way to export sequences in FASTA...if not search accession "#:#[pacc]"

[2] Run quality control

There are three main steps in pre-processing quality control:
--remove reads with adapters
--remove reads with unknown nucleotides larger than 5%
--remove reads with low quality (more than 20% of the bases' qualities are less than 10 in a read)

We sequences the olfactory bulb transcriptome using paired-end long-read HiSeq through the GE SeqWright pipeline and their quality control only preprocesses the data up through the first step of removing the adapter sequences. Because we still need to filter out bad reads, this involves some creativity.

I first tried to filter out bad reads based on their quality score, but because it is paired end, there were an uneven amount of reads removed from each paired end set (~6,000 from one; ~13,000 from another) and it would be difficult to repair the paired ends, once the order is offset.

I then tried to overlap all of the reads using Flash with the intention of then filtering bad reads from the ones that have overlap. Flash worked really well but only 59% of my pairs overlapped enough, leaving me with only 16 million reads (as opposed to 27 million). Back to square one.

I then installed a program called "Popoolation". No comment.
I followed these instructions for installation. I had to tweak the 

--------------------
This has been sitting ing "Drafts" since January. I'm going to publish it anyways.
I am revisiting my babbler project.

Looking at the Tracer files again, I see that the "babblers_8_newdate" task provided the best convergence. I installed SumTrees to find the consensus tree.

#install Setuptools
wget https://bootstrap.pypa.io/ez_setup.py -O - | sudo python
#enter password

#install SumTrees

sudo easy_install -U dendropy

Okay realized I have already made the consensus tree back in October!

See directory /Users/loloyohe/Documents/Timaliidae/Babblers Adaptive Radiation/trees/output_trees