Laurel's Lab Notebook: celera

Showing posts with label celera. Show all posts

Wednesday, July 2, 2014

New rerun of .spec file

# original asm settings
utgErrorRate = 0.02
utgErrorLimit = 2 #default is 4.5

#defaults from the manual
cnsErrorRate = 0.02
cgwErrorRate = 0.10 #this is for the scaffold so just leaving
ovlErrorRate = 0.02

merSize=14

merylMemory = 128000
merylThreads = 16

ovlStoreMemory = 8192

# grid info
useGrid = 0
scriptOnGrid = 0
frgCorrOnGrid = 0
ovlCorrOnGrid = 0

sge = -A assembly
sgeScript = -pe threads 16
sgeConsensus = -pe threads 1
sgeOverlap = -pe threads 2
sgeFragmentCorrection = -pe threads 2
sgeOverlapCorrection = -pe threads 1

#ovlMemory=8GB --hashload 0.7
ovlHashBits = 25
ovlThreads = 2
ovlHashBlockLength = 20000000
ovlRefBlockSize = 50000000

# for mer overlapper
merCompression = 1
merOverlapperSeedBatchSize = 500000
merOverlapperExtendBatchSize = 250000

frgCorrThreads = 2
frgCorrBatchSize = 100000

ovlCorrBatchSize = 100000

# non-Grid settings, if you set useGrid to 0 above these will be used
merylMemory = 128000
merylThreads = 4

ovlStoreMemory = 8192

ovlConcurrency = 8

cnsConcurrency = 8

merOverlapperThreads = 3
merOverlapperSeedConcurrency = 3
merOverlapperExtendConcurrency = 3

frgCorrConcurrency = 2
ovlCorrConcurrency = 4
cnsConcurrency = 4

Here is the summary of the PacBio reads that I am working with:

Reads: 38,703
GPC1 reads: 18,182
GPC2 reads: 20,521
Mean read length: 627

Read Quality: 0.98 (~2% error)

I am going to rerun the assembly with 2% error.

Because most of the reads are so long and ORs are technically around 900bp, Geneious could essentially align the sequences from F and R primers and there is a good chance that we will get assembled OR sequences (which we can because my adviser and I did it yesterday).

In a way this makes more sense since the goal of most Assembly programs are to assemble a genome, making a scaffold. However, we have subgenome sequences that are scattered throughout the genome. Basically, it is assembly many genes at once but doesn't need to put them in one big scaffold string.

However, I am still more comfortable with using Celera for two reasons:
1) Sequences that are too repetitive are removed. Celera is an algorithm separates the gene sequences into either unique "unitigs" to be set as seed scaffold. The repetitive or non-unique "unitigs" are saved for later to try to overlap on the seed scaffold. Because we don't really care about the scaffold, the output from Celera that we do care about is actually the degenerate unitigs, which are unique contigs (more than one read) that do not fit into a scaffold, highlighting that they are unique genes. After this, I plan to filter the degenerate unitigs by length. Then, things should be ready for the ORA pipeline.

2) You can easily control the error-correction parameters.

Monday, June 30, 2014

Getting data assembled now. Not using Sprai--going to try to use Celera now (which I have already installed).

After several hours of downloading a million things, it turns out there a bug in svn code, but not if you download the .tar.gz file directly from sourceforge. For some reason, kmer will not compile (it must not be updated) if you try to svn into it.

Anyways, here is what I got to work:
Visit: http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.1/
bzip2 -dc wgs-8.1.tar.bz2 | tar -xf - #don't compile Celera yet
cd kmer
gmake install
#oops don't have gmake; just make it the same as "make"
sudo ln -s /usr/bin/make /usr/bin/gmake
cd ..
cd samtools
make
cd ..
cd src
gmake

#i tried getting Figaro and UMDOverlapper to work but I don't want to mess
#things up; let's try this for now

In the README, it says you can run the assembler with:
wgs-8.1/*/bin/runCA
#in my case, the * is Darwin-i386

The sequences are now kept in the spare Drive
cd Volumes/Spare/pacbio/C_sowelli

gunzip filtered_subreads.fast*

gunzip reads_of_insert.fast*

#make the FRG wrap file to be inputted

~/wgs-8.1/Darwin-i386/bin/fastqToCA -libraryname GPC -technology pacbio-raw -reads reads_of_insert.fastq >GPC_untrimmed.frg

#make the .spec file--for this first run:

#saved as GPC_spec.spec

merSize = 17

merThreshold = 0

merDistinct = 0.9995

merTotal = 0.995

doOBT = 0

doExtendClearRanges = 0

unitigger = bogart

ovlErrorRate = 0.05 # Compute overlaps up to 5% error

utgGraphErrorRate = 0.05 # Unitigs at 5% error

utgMergeErrorRate = 0.05 # Unitigs at 5% error

cnsErrorRate = 0.05 # Needed to allow ovlErrorRate=0.05

cgwErrorRate = 0.05 # Needed to allow ovlErrorRate=0.05

ovlConcurrency = 16

cnsConcurrency = 16

ovlThreads = 1

ovlHashBits = 22

ovlHashBlockLength = 10000000

ovlRefBlockSize = 25000

#cnsReduceUnitigs = 0 0 # Always use only uncontained reads for consensus

cnsReuseUnitigs = 1 # With no mates, no need to redo consensus

cnsMinFrags = 1000

cnsPartitions = 256

#run the assembler

~/wgs-8.1/Darwin-i386/bin/runCA -d Volumes/Spare/pacbio/C_sowelli/Assembly/GPC-trim -p GPC-trim -s /Volumes/Spare/pacbio/C_sowelli/Assembly/GPC_spec.spec GPC_untrimmed.frg

Trying with 5% error rates since the sequences are so similar.

Friday, June 27, 2014

We have the Pacbio reads back...first step is to assemble.

I am using Sprai because Sprai will output longer contigs--essential in distinguishing different olfactory receptors.

Sprai requires:
-Python > v 2.6 (which thank goodness we have on the server)
-NCBI BLAST v.2.2.27 (which we also have!)
-Celera Assembler (which we don't have)

Installing Celera Assembler:
Dowload the file (v. 8.1--could not get v 8.2 working) and navigate to home folder:
bzip2 -dc wgs-8.1.tar.bz2 | tar -xf -
cd wgs-8.1
cd kmer && make install && cd ..
cd samtools && make && cd ..
cd src && make && cd ..
cd ..

For Sprai to work, I changed the source code to accept longer reads.
cd wgs-8.1/src
vi AS_global.h
Change:

#define AS_READ_MAX_NORMAL_LEN_BITS 11

to:

#define AS_READ_MAX_NORMAL_LEN_BITS 15

Woops--I guess the new version is already set to 16!

Now install sprai:

tar -xzf sprai-0.9.5.1.3.tar.gz

./waf configure

./waf buile

And of course I get an error.

Error #1:

/Users/loloyohe/sprai-0.9.5.1.3/col2fqcell.h:78:7: error: use of undeclared identifier 'number_of_ballots'

number_of_ballots += ballot[i];

....and this continues for anywhere "number_of_ballots" is stated.

Error #1 Solution:

"myrealigner.c" inherits the header "col2fqcell.h"

You can see in myrealigner.c there is no declaration of "number_of_objects"

In myrealigner.c, paste

int number_of_ballots = 0;

under

int maximum_ballots = 11;

Error #2 & #3:

/Users/loloyohe/sprai-0.9.5.1.3/col2fqcell.h:25:47: error: function definition is not allowed here

void set_vals(int col_index, int coded_base){

../myrealigner.c:583:102: error: function definition is not allowed here

void print_fastq(char *chr, char *seq, char *depth, char *qual, char *base_exists, char *comment){

Error #2 & #3 solution:

Facepalming the person who wrote this code. In C++, you cannot declare functions inside of functions GAHHHHHH

In "col2fqcell.h", near line 26, move:

void set_vals(int col_index, int coded_base){

++ballot[coded_base];

max_qvs[coded_base] = (max_qvs[coded_base] < (col[col_index].qv-'!')) ? (col[col_index].qv-'!') : max_qvs[coded_base];

}

outside of the function

void col2fqcell(){

...

}

Okay now the amount of errors occurring is worrisome. There are so many undeclared variables and functions. This version of the code should not have been published. I am writing the the group that has published this and will follow up.