HMM and alignment

This is a collage of stuff from papers and HHpred website

HMM and alignments

Building a profile HMM

The name ‘hidden Markov model’ comes from the fact that the state sequence is a first-order Markov chain, but only the symbol sequence is directly observed.
Alternatively, an HMM can be built from prealigned (pre-labeled) sequences (i.e. where the state paths are assumed to be known).
In the latter case, the parameter estimation problem is simply a matter of converting observed counts of symbol emissions and state transitions into probabilities.
In building a profile HMM, an existing multiple alignment is given as input.

it pays to build HMMs on pre-aligned data whenever possible.
Especially for complicated HMMs, the parameter space may be complex, with many spurious local optima that can trap a training algorithm.

How it works inside

Profile HMMs are similar to simple sequence profiles, but in addition to the amino acid frequencies in the columns of a multiple sequence alignment they contain the position-specific probabilities for inserts and deletions along the alignment
For each consensus column of the multiple alignment, a ‘match’ state models the distribution of residues allowed in the column.
An ‘insert’ state and ‘delete’ state at each column allow for insertion of one or more residues between that column and the next, or for deleting the consensus residue.
Profile HMMs are strongly linear, left right models, unlike the general HMM case.

The probability parameters in a profile HMM are usually converted to additive log-odds scores before aligning and scoring a query sequence (Barrett et al., 1997).
The scores for aligning a residue to a profile match state are therefore comparable to the derivation of BLAST or FASTA scores: if the probability of the match state emitting residue x is px , and the expected background frequency of residue x in the sequence database is fx , the score for residue x at this match state is log(px/fx).

The logarithms of these probabilities are in fact equivalent to position-specific gap penalties (Durbin et al., 1998).

In contrast to parameter estimation, a suitable HMM architecture (the number of states, and how they are connected by state transitions) must usually be designed by hand.
A maximum likelihood architecture construction algorithm exists for the special case of building profile HMMs from multiple alignments (Durbin et al., 1998).

|

Insertions tend to be seen most often in surface loops of protein structures, and so have a bias towards hydrophilic residues.
Profile HMMs can capture this information in the insert state emission distributions.

Meaning of HMM states

The states of the HMM are often associated with meaningful biological labels, such as ‘structural position 42’. In our toy HMM, for instance, states 1 and 2 correspond to a biological notion of two sequence regions with differing residue composition.
Inferring the alignment of the observed protein or DNA sequence to the hidden state sequence is like labeling the sequence with relevant biological information.

Alignment of HMMs

The alignment algorithm maximizes a weighted form of coemission probability, the probability that the two HMMs will emit the same sequence of residues.
Amino acids are weighted according to their abundance, rare coemitted amino acids contributing more to the alignment score.

Secondary structure can be included in the HMM-HMM comparison.
We score pairs of aligned secondary structure states in a way analogous to the classical amino acids substitution matrices.
We use ten different substitution matrices that we derived from a statistical analysis of the structure database, one for each confidence value given by PSIPRED.

How to use HHpred etc. for structure prediction

-- From "Fast and accurate automatic structure prediction with HHpred", Andrea Hildebrand, Michael Remmert, Andreas Biegert, and Johannes Soding , Proteins 2009 --

1. Build a multiple sequence alignment for the target sequence
- HHpred2 runs the buildali.pl script from the HHsearch 1.5.0 software package.
- This script performs up to eight iterative PSI-BLAST searches (note: using HHblits?) (note: refining the alignment by hand is probably mandatory...)
- Since the most common source for corrupted PSI-BLAST alignments is the inclusion of nonhomologous segments at the ends of local sequence matches, buildali.pl prunes the ends of each sequence separately if the similarity with the profile extracted after the first search iteration falls below 1/6 bit per column.

HHpred4 and HHpred5 build their target alignment by a maximum of five iterated HMM searches through a filtered version of the nr database with a maximum of 30% pairwise sequence identity (M. Remmert and J. Soding, manuscript in preparation).
In addition, they employ a preliminary version of context-specific pseudocounts to increase the sensitivity of these searches

2. Search for homologous templates
- Search for homologous templates: A profile hidden Markov model (HMM) is calculated from the target alignment using the hhmake executable with default parameters.
- Homologous templates are identified by searching through HHpred’s weekly updated PDB70 database using HHsearch, a method for pairwise comparison of HMMs.
- The PDB70 database contains HMMs for a representative subset of PDB sequences, built as the target alignments of HHpred2.

3. Re-rank the potential templates with a neural network
- HHsearch ranks database matches by the probability of the match to be homologous to the target sequence. This is useful to distinguish homologous from nonhomologous matches, but it is not most appropriate for ranking homologous templates according to the expected quality of the homology models they would yield.
- We therefore train a neural network to predict the TM-score of the homology model.
- Based on this prediction we re-rank the database matches.
- The following three features proved to be most informative:
  - the raw HHsearch score,
  - HHsearch’s secondary structure similarity score divided by target length,
  - the expected number of correctly aligned target residues divided by target length.

4.Generate sets of multiple alignments with successively lower sequence diversities for the target sequence and the templates
- Often, several templates can be detected with high prob.
- These templates HMM will be very similar toeach other, hence HHpred's probabilities will be poorly informative
- We need to narrow down the diversity of the target and template alignments.
- HHpred generates 10 sets of alignments with successively lower diversity for the target sequence and for all database matches with at least 80% probability.
- For this purpose, we employ hhfilter from the HHsearch package with option –qsc.
- We remove all sequences from the multiple sequence alignments which have a Gonnet matrix score per column with the target or template sequence of less than the given threshold.
- The similarity threshold is increased in 10 steps from 0.1 to 1.0 bits per column.

5. Rank target-template alignments of various alignment diversities with neural network
- All in all, one unfiltered set (down to a probability of 10%) and 10 filtered sets (down to a probability of 80%) of target–template alignments are generated by HHpred in this way.
- For each of these alignments, we predict the expected TM-score of the resulting structural model with the neural network and rank the templates according to this score.
- This procedure has two advantages:
  - First, it allows to pick the template most closely related to the target (see Fig. 1).
  - Second, it allows to choose the alignment diversity that maximizes the expected number of correctly aligned residues.
  - For example, sequences in the template multiple alignment that are more distantly related to the template than to the target will in general impair the target–template alignment quality and will be filtered out.

Search