Monday, July 20, 2015

*Most of this text is from the MBL tutorial for MSA. I am recording what I find and important points, in case the site gets lost over time.

Alignments with nucleotides: MUSCLE V. mafft
mafft --retree 2 1ped.fasta > mafft_dna.fasta

muscle -verbose -log muscle_dna.log -in 1ped.fasta -out muscle_dna.fasta

Compare the alignments resulting from A steps and B steps. Are they different? How many columns are in each the MAFFT or the MUSCLE alignment? What may be wrong with both? (Hint: these are protein coding genes).
They were not well-aligned, there were indels that were not multiples of three and they are protein coding genes!
Do the topologies and/or branch lengths differ? (Hint: look up some species names to get an idea of the expected topology!) Yes--very much so.


Bottom line: Don't do nucleotide alignments on protein-based sequences!

Comparing what happens with gaps:

muscle -verbose -log muscle_gap-20.log -in 1ped_aa.fasta -out muscle_aa_gap-20.fasta -gapopen -20
muscle -verbose -log muscle_gap-1.log -in 1ped_aa.fasta -out muscle_aa_gap-1.fasta -gapopen -1

Compare the modified gap penalty alignments to the default one. How do they differ? Which has the most gaps? Can you guess the order of magnitude of the default gap penalty? Can you find in the log files of MUSCLE the gap penalty use?
The higher gap penalty makes for a much more compressed alignment and helps to take care of this over "indel" issue. The default for MUSCLE is -2.9.


Alignment extension

   papara -t example.tre -s aln.phy -q sequence1.fasta -n run1


   papara -t example.tre -s aln.phy -q sequence2.fasta -n run2


How does the alignment look? Any concerns? How does the requirement for testing for sequence homology before site alignment apply here?
There are big tails at the ends of the new sequence. The alignment may have been trimmed previously, or the previous group used a different set of primers. The middle part is probably homologous, but a lot of times the ends may not line up perfectly. Always check your alignment.

   pagan --ref-seqfile aln.fasta -t example.tre -q sequence1.fasta -o pagan1

  pagan --ref-seqfile aln.fasta -t example.tre -q sequence2.fasta -o pagan2

Pagan requires input alignments in fasta rather than phylip format, but we have provided that in aln.fasta Try typing only 'pagan' to see the full explanation of parameters Compare the pagan alignments for the two sequences you downloaded.
What happens when you try to align the second sequence? HINT: Look at the output to screen. Why?
Adding the second sequence fails. Pagan is helping us to prevent alignments of completely random sequence! Thanks Pagan!

Both papara and pagan can align short reads as well as full length sequences, but caution about homology is particularly appropriate in these cases.

Codon Alignments

This is not part of the exercises, it's just for your future information.
As you now know, it is not appropriate to align nucleotides of protein coding regions. In the exercises above, you translated the nucleotides to amino acids which you could use to infer trees. But sometimes you want to analyze nucleotides that have been aligned by codon. So how do you go from an amino acid alignment back to codon-aligned nucleotides? We suggest the Pal2Nal server. You will upload your protein alignment and nucleotide sequences, and it will spit out the codon alignment. Please be aware that your nucleotides must be multiples of three (i.e. a full open reading frame).

No comments:

Post a Comment