Laurel's Lab Notebook: pacbio

Showing posts with label pacbio. Show all posts

Friday, October 16, 2015

Installing pbh5tools on OSX yosemite

I need to assemble PacBio amplicons of olfactory receptors. One of the main challenges I have been having is that most of the PacBio tools (SMRT-analysis, Celara assembler) focus on assembling whole genomes. Using an approach from Larson, et al. (2014) in BMC Genomics, I am going to do quality filtering using pbh5tools that will filter the reads at a minimum of passes (e.g. 4; though my mean number of passes is quite high for the data (~20)).

To install pbh5tools, I dealt with a series of challenges on Yosemite:
Download from github.

cd pbh5tools-master/

sudo python setup.py install

I ran into this pesky error.

In file included from /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include/numpy/ndarraytypes.h:1760:

/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: 

      "Using deprecated NumPy API, disable it by "          "#defining

      NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]

#warning "Using deprecated NumPy API, disable it by " \

^

In file included from /tmp/easy_install-_W_eCV/h5py-2.5.0/h5py/defs.c:279:

/tmp/easy_install-_W_eCV/h5py-2.5.0/h5py/api_compat.h:27:10: fatal error: 

      'hdf5.h' file not found

#include "hdf5.h"

^

1 warning and 1 error generated.

error: Setup script exited with error: command 'cc' failed with exit status 1

First I tries to install hdf5 tool packages and then export the path there, but that wasn't working.

From the setup.py, I see that it needs:
install_requires=[
'pbcore >= 0.8.0',
'numpy >= 1.6.0',
'h5py >= 1.3.0'
]

sudo pip install pbcore

sudo pip install numpy

sudo pip install h5py

Then,

sudo python setup.py install

...

Using /Library/Python/2.7/site-packages

Searching for pysam==0.8.3

Best match: pysam 0.8.3

pysam 0.8.3 is already the active version in easy-install.pth

Using /Library/Python/2.7/site-packages

Finished processing dependencies for pbh5tools==0.8.0

Voila!

Friday, July 11, 2014

Visit: http://ec2-54-88-69-219.compute-1.amazonaws.com:8080/smrtportal/

Login.

To create a new job:
Click "Create New"
Click "de Novo Assembly" and "Data Prep" then click "Next"

The data for C. sowelli is already loaded as a SMRT cell. Start a new job by naming the job, selecting the protocol, and choose "C. sowelli_amp" and move it from SMRT Cells available to "SMRT Cells in Job". Click "save" and "start"at the bottom right-hand corner.

To view the failed jobs, click on the job. On the right hand side, click "View Log".

I have tried RS_HGAP Assembly 2 for quality, RS_HGAP 3 for speed, and a Long Amplicon assembly and all three have failed and it is not clear why because it seems to start out filtering with no problem.

Wednesday, July 2, 2014

New rerun of .spec file

# original asm settings
utgErrorRate = 0.02
utgErrorLimit = 2 #default is 4.5

#defaults from the manual
cnsErrorRate = 0.02
cgwErrorRate = 0.10 #this is for the scaffold so just leaving
ovlErrorRate = 0.02

merSize=14

merylMemory = 128000
merylThreads = 16

ovlStoreMemory = 8192

# grid info
useGrid = 0
scriptOnGrid = 0
frgCorrOnGrid = 0
ovlCorrOnGrid = 0

sge = -A assembly
sgeScript = -pe threads 16
sgeConsensus = -pe threads 1
sgeOverlap = -pe threads 2
sgeFragmentCorrection = -pe threads 2
sgeOverlapCorrection = -pe threads 1

#ovlMemory=8GB --hashload 0.7
ovlHashBits = 25
ovlThreads = 2
ovlHashBlockLength = 20000000
ovlRefBlockSize = 50000000

# for mer overlapper
merCompression = 1
merOverlapperSeedBatchSize = 500000
merOverlapperExtendBatchSize = 250000

frgCorrThreads = 2
frgCorrBatchSize = 100000

ovlCorrBatchSize = 100000

# non-Grid settings, if you set useGrid to 0 above these will be used
merylMemory = 128000
merylThreads = 4

ovlStoreMemory = 8192

ovlConcurrency = 8

cnsConcurrency = 8

merOverlapperThreads = 3
merOverlapperSeedConcurrency = 3
merOverlapperExtendConcurrency = 3

frgCorrConcurrency = 2
ovlCorrConcurrency = 4
cnsConcurrency = 4

Here is the summary of the PacBio reads that I am working with:

Reads: 38,703
GPC1 reads: 18,182
GPC2 reads: 20,521
Mean read length: 627

Read Quality: 0.98 (~2% error)

I am going to rerun the assembly with 2% error.

Because most of the reads are so long and ORs are technically around 900bp, Geneious could essentially align the sequences from F and R primers and there is a good chance that we will get assembled OR sequences (which we can because my adviser and I did it yesterday).

In a way this makes more sense since the goal of most Assembly programs are to assemble a genome, making a scaffold. However, we have subgenome sequences that are scattered throughout the genome. Basically, it is assembly many genes at once but doesn't need to put them in one big scaffold string.

However, I am still more comfortable with using Celera for two reasons:
1) Sequences that are too repetitive are removed. Celera is an algorithm separates the gene sequences into either unique "unitigs" to be set as seed scaffold. The repetitive or non-unique "unitigs" are saved for later to try to overlap on the seed scaffold. Because we don't really care about the scaffold, the output from Celera that we do care about is actually the degenerate unitigs, which are unique contigs (more than one read) that do not fit into a scaffold, highlighting that they are unique genes. After this, I plan to filter the degenerate unitigs by length. Then, things should be ready for the ORA pipeline.

2) You can easily control the error-correction parameters.

Tuesday, July 1, 2014

Things to do today:
[1] Fill out the CDC form for Peru COMPLETE
[2] Fix the .spec file for assembly COMPLETE
I am trying this now (.spec file):

# original asm settings
utgErrorRate = 0.05
utgErrorLimit = 4.5

#defaults from the manual
cnsErrorRate = 0.06
cgwErrorRate = 0.10
ovlErrorRate = 0.06

merSize=14

merylMemory = 128000
merylThreads = 16

ovlStoreMemory = 8192

# grid info
useGrid = 0
scriptOnGrid = 0
frgCorrOnGrid = 0
ovlCorrOnGrid = 0

sge = -A assembly
sgeScript = -pe threads 16
sgeConsensus = -pe threads 1
sgeOverlap = -pe threads 2
sgeFragmentCorrection = -pe threads 2
sgeOverlapCorrection = -pe threads 1

#ovlMemory=8GB --hashload 0.7
ovlHashBits = 25
ovlThreads = 2
ovlHashBlockLength = 20000000
ovlRefBlockSize = 50000000

# for mer overlapper
merCompression = 1
merOverlapperSeedBatchSize = 500000
merOverlapperExtendBatchSize = 250000

frgCorrThreads = 2
frgCorrBatchSize = 100000

ovlCorrBatchSize = 100000

# non-Grid settings, if you set useGrid to 0 above these will be used
merylMemory = 128000
merylThreads = 4

ovlStoreMemory = 8192

ovlConcurrency = 8

cnsConcurrency = 8

merOverlapperThreads = 3
merOverlapperSeedConcurrency = 3
merOverlapperExtendConcurrency = 3

frgCorrConcurrency = 2
ovlCorrConcurrency = 4
cnsConcurrency = 4

I took out the "stopAfter=overlapper"...because that interrupted. Not sure why we would want to do that.

[3] Write syllabus for high school student COMPLETE [7/2/2014]
[4] Look at Trpc2 data big Mac broken :[
[5] Enter grades into blackboard COMPLETE
[6] submit conservation paper COMPLETE [7/2/1/2014]
--------------------------------------------------

Monday, June 30, 2014

Getting data assembled now. Not using Sprai--going to try to use Celera now (which I have already installed).

After several hours of downloading a million things, it turns out there a bug in svn code, but not if you download the .tar.gz file directly from sourceforge. For some reason, kmer will not compile (it must not be updated) if you try to svn into it.

Anyways, here is what I got to work:
Visit: http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.1/
bzip2 -dc wgs-8.1.tar.bz2 | tar -xf - #don't compile Celera yet
cd kmer
gmake install
#oops don't have gmake; just make it the same as "make"
sudo ln -s /usr/bin/make /usr/bin/gmake
cd ..
cd samtools
make
cd ..
cd src
gmake

#i tried getting Figaro and UMDOverlapper to work but I don't want to mess
#things up; let's try this for now

In the README, it says you can run the assembler with:
wgs-8.1/*/bin/runCA
#in my case, the * is Darwin-i386

The sequences are now kept in the spare Drive
cd Volumes/Spare/pacbio/C_sowelli

gunzip filtered_subreads.fast*

gunzip reads_of_insert.fast*

#make the FRG wrap file to be inputted

~/wgs-8.1/Darwin-i386/bin/fastqToCA -libraryname GPC -technology pacbio-raw -reads reads_of_insert.fastq >GPC_untrimmed.frg

#make the .spec file--for this first run:

#saved as GPC_spec.spec

merSize = 17

merThreshold = 0

merDistinct = 0.9995

merTotal = 0.995

doOBT = 0

doExtendClearRanges = 0

unitigger = bogart

ovlErrorRate = 0.05 # Compute overlaps up to 5% error

utgGraphErrorRate = 0.05 # Unitigs at 5% error

utgMergeErrorRate = 0.05 # Unitigs at 5% error

cnsErrorRate = 0.05 # Needed to allow ovlErrorRate=0.05

cgwErrorRate = 0.05 # Needed to allow ovlErrorRate=0.05

ovlConcurrency = 16

cnsConcurrency = 16

ovlThreads = 1

ovlHashBits = 22

ovlHashBlockLength = 10000000

ovlRefBlockSize = 25000

#cnsReduceUnitigs = 0 0 # Always use only uncontained reads for consensus

cnsReuseUnitigs = 1 # With no mates, no need to redo consensus

cnsMinFrags = 1000

cnsPartitions = 256

#run the assembler

~/wgs-8.1/Darwin-i386/bin/runCA -d Volumes/Spare/pacbio/C_sowelli/Assembly/GPC-trim -p GPC-trim -s /Volumes/Spare/pacbio/C_sowelli/Assembly/GPC_spec.spec GPC_untrimmed.frg

Trying with 5% error rates since the sequences are so similar.

Friday, June 27, 2014

We have the Pacbio reads back...first step is to assemble.

I am using Sprai because Sprai will output longer contigs--essential in distinguishing different olfactory receptors.

Sprai requires:
-Python > v 2.6 (which thank goodness we have on the server)
-NCBI BLAST v.2.2.27 (which we also have!)
-Celera Assembler (which we don't have)

Installing Celera Assembler:
Dowload the file (v. 8.1--could not get v 8.2 working) and navigate to home folder:
bzip2 -dc wgs-8.1.tar.bz2 | tar -xf -
cd wgs-8.1
cd kmer && make install && cd ..
cd samtools && make && cd ..
cd src && make && cd ..
cd ..

For Sprai to work, I changed the source code to accept longer reads.
cd wgs-8.1/src
vi AS_global.h
Change:

#define AS_READ_MAX_NORMAL_LEN_BITS 11

to:

#define AS_READ_MAX_NORMAL_LEN_BITS 15

Woops--I guess the new version is already set to 16!

Now install sprai:

tar -xzf sprai-0.9.5.1.3.tar.gz

./waf configure

./waf buile

And of course I get an error.

Error #1:

/Users/loloyohe/sprai-0.9.5.1.3/col2fqcell.h:78:7: error: use of undeclared identifier 'number_of_ballots'

number_of_ballots += ballot[i];

....and this continues for anywhere "number_of_ballots" is stated.

Error #1 Solution:

"myrealigner.c" inherits the header "col2fqcell.h"

You can see in myrealigner.c there is no declaration of "number_of_objects"

In myrealigner.c, paste

int number_of_ballots = 0;

under

int maximum_ballots = 11;

Error #2 & #3:

/Users/loloyohe/sprai-0.9.5.1.3/col2fqcell.h:25:47: error: function definition is not allowed here

void set_vals(int col_index, int coded_base){

../myrealigner.c:583:102: error: function definition is not allowed here

void print_fastq(char *chr, char *seq, char *depth, char *qual, char *base_exists, char *comment){

Error #2 & #3 solution:

Facepalming the person who wrote this code. In C++, you cannot declare functions inside of functions GAHHHHHH

In "col2fqcell.h", near line 26, move:

void set_vals(int col_index, int coded_base){

++ballot[coded_base];

max_qvs[coded_base] = (max_qvs[coded_base] < (col[col_index].qv-'!')) ? (col[col_index].qv-'!') : max_qvs[coded_base];

}

outside of the function

void col2fqcell(){

...

}

Okay now the amount of errors occurring is worrisome. There are so many undeclared variables and functions. This version of the code should not have been published. I am writing the the group that has published this and will follow up.