Wednesday, July 2, 2014

Here is the summary of the PacBio reads that I am working with:

Reads: 38,703
GPC1 reads: 18,182
GPC2 reads: 20,521
Mean read length: 627

Read Quality: 0.98 (~2% error)

I am going to rerun the assembly with 2% error.

Because most of the reads are so long and ORs are technically around 900bp, Geneious could essentially align the sequences from F and R primers and there is a good chance that we will get assembled OR sequences (which we can because my adviser and I did it yesterday).

In a way this makes more sense since the goal of most Assembly programs are to assemble a genome, making a scaffold. However, we have subgenome sequences that are scattered throughout the genome. Basically, it is assembly many genes at once but doesn't need to put them in one big scaffold string.

However, I am still more comfortable with using Celera for two reasons:
1) Sequences that are too repetitive are removed. Celera is an algorithm separates the gene sequences into either unique "unitigs" to be set as seed scaffold. The repetitive or non-unique "unitigs" are saved for later to try to overlap on the seed scaffold. Because we don't really care about the scaffold, the output from Celera that we do care about is actually the degenerate unitigs, which are unique contigs (more than one read) that do not fit into a scaffold, highlighting that they are unique genes. After this, I plan to filter the degenerate unitigs by length. Then, things should be ready for the ORA pipeline.

2) You can easily control the error-correction parameters.

No comments:

Post a Comment