Sunday, September 16, 2012

Most of the delay in the progress of the intron repeat sorting program was delayed due to trying to set up a new IDE on a new machine. After installing a GCC compiler, it was a nightmare trying to get a program to build in Eclipse. Days were spent setting up environmental variables. Finally, all hope was abandoned and I tried Code::Blocks. The initial RepeatScout program (at least the part I am going to be working on) built and ran on the first try. SIGH. Finally could get the ball rolling.

Yesterday and today were by far the most productive. The initial problem I wanted to tackle was the user input. Instead of confusing parameters to try to follow, the program prompts the user to enter in the .fas file location in a very friendly manner.The program now currently reads input in from FASTA sequences and locates repetitive elements via the l-mer algorithm mentioned below. My intention is to score the introns based on the degree of repetitiveness. This score will be based on the number of times an l-mer repeats and then sort the sequences based on their scores. The output will be a sorted .fas file.

To do:
  • Determine whether program is reading file as individual sequences or as one giant genomic sequences. It should ideally be the first, but I need to double check. First priority.
  • Write user prompt to enter in the name of the output file for the sorted fasta sequences to be placed.
  • Put in counter to detect number of times a repetitive element is found in a sequence. This will be the score.
  • Sort list with the score.
  • Write output to .fas
If worked on full-time, it could probably be finished in two days. 

Also need to type up minutes from Skype meeting with Sushma and Liliana from Thursday. Tomorrow....

Wednesday, September 5, 2012

So RepeatScout is only going to serve as inspiration for a future mini-program. It doesn't do what I want it to and nothing else does either so I am going to use the same lines of thinking and make a simpler program. Also, I hate the command prompt of RepeatScout and every other program I am trying so that is the first thing that's going to change.

Now to clarify my line of thinking...

OBJECTIVE OF PROGRAM: To read a fasta sequence and find repetitive elements and note a degree of repetitiveness (# of repetitive elements? # of l-mers repeated? both? TBD). It will then sort the list of original fasta sequences by the degree of repetitiveness. Output will be a list of sorted fasta sequences (and maybe something else giving more details, we'll see).

Right now, RepeatScout is mostly written in C. I like C so I will probably continue to do so. Remember to add this paper to my Mendeley to remember exact details of RepeatScout algorithm.

Goal for the day: Write the file opener, reader, command prompt. Try to begin to incorporate l-mer finder.