HMM and alignment

This is a collage of stuff from papers and HHpred website

The name ‘hidden Markov model’ comes from the fact that the state sequence is a first-order Markov chain, but only the symbol sequence is directly observed.
Alternatively, an HMM can be built from prealigned (pre-labeled) sequences (i.e. where the state paths are assumed to be known).
In the latter case, the parameter estimation problem is simply a matter of converting observed counts of symbol emissions and state transitions into probabilities.
In building a profile HMM, an existing multiple alignment is given as input.

it pays to build HMMs on pre-aligned data whenever possible.
Especially for complicated HMMs, the parameter space may be complex, with many spurious local optima that can trap a training algorithm.

Profile HMMs are similar to simple sequence profiles, but in addition to the amino acid frequencies in the columns of a multiple sequence alignment they contain the position-specific probabilities for inserts and deletions along the alignment
For each consensus column of the multiple alignment, a ‘match’ state models the distribution of residues allowed in the column.
An ‘insert’ state and ‘delete’ state at each column allow for insertion of one or more residues between that column and the next, or for deleting the consensus residue.
Profile HMMs are strongly linear, left right models, unlike the general HMM case.

The logarithms of these probabilities are in fact equivalent to position-specific gap penalties (Durbin et al., 1998).

In contrast to parameter estimation, a suitable HMM architecture (the number of states, and how they are connected by state transitions) must usually be designed by hand.
A maximum likelihood architecture construction algorithm exists for the special case of building profile HMMs from multiple alignments (Durbin et al., 1998).

|

The states of the HMM are often associated with meaningful biological labels, such as ‘structural position 42’. In our toy HMM, for instance, states 1 and 2 correspond to a biological notion of two sequence regions with differing residue composition.
Inferring the alignment of the observed protein or DNA sequence to the hidden state sequence is like labeling the sequence with relevant biological information.

The alignment algorithm maximizes a weighted form of coemission probability, the probability that the two HMMs will emit the same sequence of residues.
Amino acids are weighted according to their abundance, rare coemitted amino acids contributing more to the alignment score.

Secondary structure can be included in the HMM-HMM comparison.
We score pairs of aligned secondary structure states in a way analogous to the classical amino acids substitution matrices.
We use ten different substitution matrices that we derived from a statistical analysis of the structure database, one for each confidence value given by PSIPRED.