Monday, July 20, 2015

FASTA tutorial

Identifying homologs and non-homologs; effects of scoring matrices and algorithms; using domain annotations

1. Use the FASTA search page [pgm] to compare Honey bee glutathione transferase D1 NP_001171499/ H9KLY5_APIME [seq] (gi|295842263) to the PIR1 Annotated protein sequence database. Be sure to press, not .
  1. Take a look at the output.
    1. How long is the query sequence?
      1. 217 amino acids
    2. How many sequences are in the PIR1 database?
      1. 13,143 sequences
    3. What scoring matrix was used?
      1. BL50 matrix
    4. What were the gap penalties? (what is the penalty for a one-residue gap? two residues?)
      1. one residue: -12 open gap, the -14 for two residues
    5. What are each of the numbers after the description of the library sequence? Which one is best for inferring homology?
      1. sp|P20432|GSTT1_DROME Glutathione S-transferase 1-1; DD ( 209)
      2. 209 is referring to the length of sequence that it matches
    6. How similar is the highest scoring sequence? What is the difference between %_id and %_sim? Why is there no 100% identity match?
      1. 81% similar,  but 57% identity
      2. The sequence is not in the database
    7. Looking at an alignment, where are the boundaries of the alignment (the best local region)? How many gaps are in the best alignment? The second best?
      1. Local alignment, score gets worse if you add more residue so important to consider
      2. There are no gaps in the first, there are 14 in the second
  2. What is the highest scoring non-homolog? (The non-homolog with the highest alignment score, or the lowest E()-value.) How would you confirm that your candidate non-homolog was truly unrelated? (Hint - compare your candidate non-homolog with SwissProt for a more comprehensive test.)
    1. Highest scoring unrelated sequence should have an E() value of 1 (or around 1). Can do "general re-search of this sequence and make sure it does not find your originally queried value. 
  3. Homology (and non-homology) can also be inferred by domain relationships. Try the same Honey bee GSTD1 search [pgm] search using the Annotate Query and Database Annotations: set to show .
    1. Does the domain display change your mind about the highest scoring non-homolog?
      1. Results with E() of 0.11 show completely different shared domains and some with an even higher E() show a shared domain. Often say NODOM Q-values <30 have no domain.
    2. There are three parts to the domain display, the domain structure of the query (top) sequence (if available), the domain structure of the library (bottom) sequence, and the domain alignment boundaries in the middle (inside the alignment box). The boundaries and color of the alignment domain coloring match the Region: sub-alignment scores.
      1. Based on the score, the second part of the domain is not significant. Sometimes the homolog is "too far away" (evolutionary distance) to be given a high score.  
      2. Domains have characteristic length (about 100 residues). Look to see if there is room for a domain, if it is not contributing to the score, it could just be too far away.
      3. Use transitivity from other sequences to make decision it is there.
    3. Note that the alignment of Honey bee GSTD1 and SSPA_ECO57 includes portions of both the N-terminal and C-terminal domains, but neither domain is completely aligned. Why do you think the alignments do not include the complete domains?
    4. Is your explanation for the partial domain alignment consistent the the argument that domains have a characteristic length? How might you test whether a complete domain is present?
    5. The FASTA programs can partition an alignment score based domain boundaries, and report the amount of score associated with a domain. In the SSPA_ECO57 alignment, how much of the score comes from the GST-N terminal domain? The GST-C terminal domain? Does this alignment provide strong support for the presence of a GST-C terminal domain? How might you test for the presence of the domain on GSTD1? On SSPA_ECO57?In the subalignment scores, the Q value is -10 * log(p) for the sub-alignment score, so Q=30.0 means p < 0.001.
  4. Repeat the GSTD1 search [pgm] using the BLASTP62/-11/-1 scoring matrix  that BLAST uses. Re-examine the GSTD1:SSPA_ECO57 alignment. Are both Glutathione transferase domains present? Look at the alignments to the homologs above and below SSPA_ECO57. Based on those aligments, do you think the Glutathione-S-Trfase C-like domain is really missing? Why did the alignment become so much shorter?
    1. SKIP
  5. One of the candidate non-homologs is sp|Q9SI20|EF1D2_ARATH, with an E()-value of 0.11.
    1. Does the domain structure of EF1D2_ARATH suggest that it could be a glutatione transferase homolog?
      1. No, it looks very different
      2. Only residues 8-60 overlapped
    2. Use the General Research to explore the domains contained in EF1D2_ARATH homologs found in SwissProt.
    3. Does this secondary search support homology or non-homology?

      1. Focusing on the NODOM, it actually just isn't annotated! Scorlling down you can see that a glutathion homolog eventaully does show up in the annotation!!



Trust annotations when they say something, but don't trust them if they don't say something. 

2. Exploring domains and alignment over-extension -- cortactin (SRC8_HUMAN)
Compare SRC8_HUMAN [pgm] (human cortactin) to the SwissProt protein sequence database.

  1. Looking at the top five alignments, how many cortactin orthologs do you see? (ortholog, same protein, different species).
    1. Four orthologs
  2. In the SRC8 HUMAN:CHICK alignment, both the query and the subject (library) sequences align seven cortactin domains and an SH3 domain. In addition, two regions (one before the cortactin domain cluster and one after) are well conserved, but do not have annotated domains (NODOM). Are these non-domain (NODOM) regions as well conserved as the annotated domains?

  3. Look at the SRC8_HUMAN:HCLS1_MOUSE alignment. How many cortactin domains does HCLS1_MOUSE contain? How much score does the NODOM spanning the region between cortactin domains and the SH3 domain contribute? Why is it included in the alignment? Is it likely to be homologous?
    1. It is a homolog, but not an ortholog--it is a paralog! We already found mouse domain; only 4 shared domain too. 
  4. Is the NODOM between the cortactin domains and the SH3 domain likely to be homologous in the SRC8_HUMAN:DBNLB_XENLA alignment?
    1. No. The domain score of the second, non annotated domain has a poor score (even negative!). Still needs to be included to include the SH3 domain. No SH3 domain, the alignment would stop.
  5. In the SRC8_HUMAN:LASP1_HUMAN alignment, the alignment extends to include several Nebulin_repeat domains. Do you think there is a Nebulin_repeat domain in SRC8_HUMAN? Why do you think those domains are aligning?
  6. What scoring matrix should be used to reduce over-extension from the SH3 domain?

3. Exploring domains and over-extension with local alignments -- death associated protein kinase (DAPK1_HUMAN)
  1. Look up the domain structure of DAPK1_HUMAN at Pfam [pgm].
    1. What are the major (PfamA) domain regions on the protein?
    2. Which of the domains is repeated?
    3. In a local (LALIGN) alignment, where would you expect to see overlapping domains like those in Calmodulin (CALM_HUMAN) and Cortactin (DAPK1_HUMAN)?
  2. Use lalign/plalign [pgm] to examine local similarities between DAPK1_HUMAN and itself. Check the options to "annotate sequence 1 domains" and "annotate sequence 2 domains". Do you see the domains you expected from Pfam? Do they map in the same places?


No comments:

Post a Comment