Laurel's Lab Notebook: sequence repetition

Showing posts with label sequence repetition. Show all posts

Sunday, September 16, 2012

Most of the delay in the progress of the intron repeat sorting program was delayed due to trying to set up a new IDE on a new machine. After installing a GCC compiler, it was a nightmare trying to get a program to build in Eclipse. Days were spent setting up environmental variables. Finally, all hope was abandoned and I tried Code::Blocks. The initial RepeatScout program (at least the part I am going to be working on) built and ran on the first try. SIGH. Finally could get the ball rolling.

Yesterday and today were by far the most productive. The initial problem I wanted to tackle was the user input. Instead of confusing parameters to try to follow, the program prompts the user to enter in the .fas file location in a very friendly manner.The program now currently reads input in from FASTA sequences and locates repetitive elements via the l-mer algorithm mentioned below. My intention is to score the introns based on the degree of repetitiveness. This score will be based on the number of times an l-mer repeats and then sort the sequences based on their scores. The output will be a sorted .fas file.

To do:

Determine whether program is reading file as individual sequences or as one giant genomic sequences. It should ideally be the first, but I need to double check. First priority.
Write user prompt to enter in the name of the output file for the sorted fasta sequences to be placed.
Put in counter to detect number of times a repetitive element is found in a sequence. This will be the score.
Sort list with the score.
Write output to .fas

If worked on full-time, it could probably be finished in two days.

Also need to type up minutes from Skype meeting with Sushma and Liliana from Thursday. Tomorrow....

Wednesday, September 5, 2012

So RepeatScout is only going to serve as inspiration for a future mini-program. It doesn't do what I want it to and nothing else does either so I am going to use the same lines of thinking and make a simpler program. Also, I hate the command prompt of RepeatScout and every other program I am trying so that is the first thing that's going to change.

Now to clarify my line of thinking...

OBJECTIVE OF PROGRAM: To read a fasta sequence and find repetitive elements and note a degree of repetitiveness (# of repetitive elements? # of l-mers repeated? both? TBD). It will then sort the list of original fasta sequences by the degree of repetitiveness. Output will be a list of sorted fasta sequences (and maybe something else giving more details, we'll see).

Right now, RepeatScout is mostly written in C. I like C so I will probably continue to do so. Remember to add this paper to my Mendeley to remember exact details of RepeatScout algorithm.

Goal for the day: Write the file opener, reader, command prompt. Try to begin to incorporate l-mer finder.

Thursday, August 30, 2012

By limiting primer lengths to be only 15-20bp in Geneious, our list of introns narrowed down from 900 to 134 introns that we were able to generate primers for. Two base pairs makes a huge difference!

Limiting primer lengths to 15-21bp, 348 primers were found of the 900 target sequences. Over 3,000 primers were found.

Decided it made more sense to sift through the introns first to see what is repetitive before actually making the primers. Thinking of how to go about this, I would imagine there are many different solutions. It is looking for a sequence of repetitions within a larger given sequence. Reminds me of n-mer algorithm. Looked through various software available already. Saha, et al. 2008 provided good overview of software and review of what I am actually trying to do. Settled for trying RepeatScout as I could easily install it on the lab Mac and code looks manageable to manipulate

After finally figuring out the parameters, it looks promising. Will pick up where I left off tomorrow.