Difference between revisions of "HMM and alignment"

Revision as of 19:03, 11 February 2013

This is a collage of stuff from papers and HHpred website

The name ‘hidden Markov model’ comes from the fact that the state sequence is a first-order Markov chain, but only the symbol sequence is directly observed.
Alternatively, an HMM can be built from prealigned (pre-labeled) sequences (i.e. where the state paths are assumed to be known).
In the latter case, the parameter estimation problem is simply a matter of converting observed counts of symbol emissions and state transitions into probabilities.
In building a profile HMM, an existing multiple alignment is given as input.

it pays to build HMMs on pre-aligned data whenever possible.
Especially for complicated HMMs, the parameter space may be complex, with many spurious local optima that can trap a training algorithm.

Profile HMMs are similar to simple sequence profiles, but in addition to the amino acid frequencies in the columns of a multiple sequence alignment they contain the position-specific probabilities for inserts and deletions along the alignment
For each consensus column of the multiple alignment, a ‘match’ state models the distribution of residues allowed in the column.
An ‘insert’ state and ‘delete’ state at each column allow for insertion of one or more residues between that column and the next, or for deleting the consensus residue.
Profile HMMs are strongly linear, left right models, unlike the general HMM case.

The probability parameters in a profile HMM are usually converted to additive log-odds scores before aligning and scoring a query sequence (Barrett et al., 1997).
The scores for aligning a residue to a profile match state are therefore comparable to the derivation of BLAST or FASTA scores: if the probability of the match state emitting residue x is px , and the expected background frequency of residue x in the sequence database is fx , the score for residue x at this match state is log(px/fx).

The logarithms of these probabilities are in fact equivalent to position-specific gap penalties (Durbin et al., 1998).

In contrast to parameter estimation, a suitable HMM architecture (the number of states, and how they are connected by state transitions) must usually be designed by hand.
A maximum likelihood architecture construction algorithm exists for the special case of building profile HMMs from multiple alignments (Durbin et al., 1998).

|

Insertions tend to be seen most often in surface loops of protein structures, and so have a bias towards hydrophilic residues.
Profile HMMs can capture this information in the insert state emission distributions.

The states of the HMM are often associated with meaningful biological labels, such as ‘structural position 42’. In our toy HMM, for instance, states 1 and 2 correspond to a biological notion of two sequence regions with differing residue composition.
Inferring the alignment of the observed protein or DNA sequence to the hidden state sequence is like labeling the sequence with relevant biological information.

The alignment algorithm maximizes a weighted form of coemission probability, the probability that the two HMMs will emit the same sequence of residues.
Amino acids are weighted according to their abundance, rare coemitted amino acids contributing more to the alignment score.

Secondary structure can be included in the HMM-HMM comparison.
We score pairs of aligned secondary structure states in a way analogous to the classical amino acids substitution matrices.
We use ten different substitution matrices that we derived from a statistical analysis of the structure database, one for each confidence value given by PSIPRED.

Revision as of 19:03, 11 February 2013 (view source) Devicerandom (Talk \| contribs) (→‎How it works inside) ← Older edit		Revision as of 19:03, 11 February 2013 (view source) Devicerandom (Talk \| contribs) Newer edit →
Line 47:		Line 47:


−	[[Category:Bioinfo notes]]	+	[[Category:Bioinfo notes\|Hidden Markov models]]