Part of Advances in Neural Information Processing Systems 7 (NIPS 1994)

*Lawrence Saul, Michael Jordan*

We propose a statistical mechanical framework for the modeling of discrete time series. Maximum likelihood estimation is done via Boltzmann learning in one-dimensional networks with tied weights. We call these networks Boltzmann chains and show that they contain hidden Markov models (HMMs) as a special case. Our framework also motivates new architectures that address partic(cid:173) ular shortcomings of HMMs. We look at two such architectures: parallel chains that model feature sets with disparate time scales, and looped networks that model long-term dependencies between hidden states. For these networks, we show how to implement the Boltzmann learning rule exactly, in polynomial time, without resort to simulated or mean-field annealing. The necessary com(cid:173) putations are done by exact decimation procedures from statistical mechanics.

1

INTRODUCTION AND SUMMARY

Statistical models of discrete time series have a wide range of applications, most notably to problems in speech recognition (Juang & Rabiner, 1991) and molecular biology (Baldi, Chauvin, Hunkapiller, & McClure, 1992). A common problem in these fields is to find a probabilistic model, and a set of model parameters, that

436

Lawrence K. Saul, Michael I. Jordan

account for sequences of observed data. Hidden Markov models (HMMs) have been particularly successful at modeling discrete time series. One reason for this is the powerful learning rule (Baum) 1972») a special case of the Expectation-Maximization (EM) procedure for maximum likelihood estimation (Dempster) Laird) & Rubin) 1977). In this work) we develop a statistical mechanical framework for the modeling of discrete time series. The framework enables us to relate HMMs to a large family of exactly solvable models in statistical mechanics. The connection to statistical mechanics was first noticed by Sourlas (1989») who studied spin glass models of error-correcting codes. We view the estimation procedure for HMMs as a special (and particularly tractable) case of the Boltzmann learning rule (Ackley) Hinton) & Sejnowski) 1985; Byrne) 1992). The rest of this paper is organized as follows . In Section 2) we review the modeling problem for discrete time series and establish the connection between HMMs and Boltzmann machines. In Section 3) we show how to quickly determine whether or not a particular Boltzmann machine is tractable) and if so) how to efficiently compute the correlations in the Boltzmann learning rule. Finally) in Section 4) we look at two architectures that address particular weaknesses of HMMs: the modelling of disparate time scales and long-term dependencies.

2 MODELING DISCRETE TIME SERIES

A discrete time series is a sequence of symbols {jdr=l in which each symbol belongs to a finite countable set) i.e. jl E {1) 2) .. . ) m}. Given one long sequence) or perhaps many shorter ones) the modeling task is to characterize the probability distribution from which the time series are generated.

2.1 HIDDEN MARKOV MODELS

A first-order Hidden Markov Model (HMM) is characterized by a set of n hidden states) an alphabet of m symbols) a transmission matrix ajj') an emission matrix bjj ) and a prior distribution 7I'j over the initial hidden state. The sequence of states {idr=l and symbols {jdr=l is modeled to occur with probability

(1)

The modeling problem is to find the parameter values (ajj' , bij ) 7I'j) that maximize the likelihood of observed sequences of training data. We will elaborate on the learning rule in section 2.3) but first let us make the connection to a well-known family of stochastic neural networks, namely Boltzmann machines.

2.2 BOLTZMANN MACHINES

Consider a Boltzmann machine with m-state visible units) n-state hidden units) tied weights) and the linear architecture shown in Figure 1. This example represents the simplest possible Boltzmann "chain))) one that is essentially equivalent to a first(cid:173) order HMM unfolded in time (MacKay) 1994). The transition weights Aii' connect adjacent hidden units) while the emission weights Bjj connect each hidden unit to

Boltzmann Chains and Hidden Markov Models

Do not remove: This comment is monitored to verify that the site is working properly