Boltzmann machine is an extension of Hopfield networks that use
stochastic binary neurons. These neurons are present in visible and hidden
layers and the stochasticity is that each neuron can be in ON or OFF state.
Here’s a simpler explanation with an example. Boltzmann machines
are mainly for associative memory. Imagine you bump into one of your classmates
after a decade and try to recall the friend’s name; you are likely to do this
by relating to the GCSE or bachelors class i.e. you do a associative memory
recall. I believe Boltzmann machines were designed from this thought.
After the Boltzmann network is
modelled, when you present a new data item at the visible layer it tries to
recognise the pattern by triggering the appropriate hidden neurons. In order to
model the network or the associations, we need to learn a joint probability
distribution of hidden and visible neurons. But, learning a joint distribution
is not easy and it is much more convenient to learn conditional distribution by
assuming some independence. This led to Restricted
Boltzmann Machines with the independence being the absence of connection
between hidden neurons.
So how do we learn the joint distribution? If we try to learn the
distribution analytically, it is a tedious task as computation involves
learning the weights within and between hidden and visible layers. And this
needs to be done for each dimension of every training data which is a tedious
task; or in other words we have an intractable problem.
Indeed we are trying to learn a Boltzmann distribution here. It has the
nice property that if we start sampling from anywhere in the distribution and
continue doing so, it will approximate the distribution and eventually end up
learning the statistical properties of the underlying data. We also try to simplify
the problem by learning one of the variables at a time by sampling on the
other. The idea of contrastive divergence is that we learn a simpler point
estimate of one variable (say, visible or hidden) as a route to learning the other
and thus approximating the larger Boltzmann distribution. Contrastive divergence may be seen as a work
around that eventually came to be accepted as a solution. All these wonderful
ideas are implemented with Gibbs
sampling (i.e. sample hidden given visible or sample visible given hidden)
to learn the entire distribution.
As we start by clamping (in
simple words, providing) training data to the visible layer, it fires hidden
neurons and the hidden layer learns the features. But as the network learns the
features, we like to know what the model believes in. This is done by
generating data (or fantasies) from the model. This is called a free-run as we
do not have keep any restrictions (like the clamping we did for the visible
layer). We can compare the network’s
output and training data to validate network’s belief. (For e.g. the comparison
of digit 2 in the lecture notes). In a nutshell, our intention is to minimise
the Divergence in the Contrast (i.e. observed differences)
between fantasy and training data and this is called minimisation of Contrastive Divergence.
Lastly, what is the Markov Chain Monte Carlo bit? In doing all
these, we make a major assumption of Markov Chain property i.e. the subsequent
state is dependent on current state. This is important while learning the
distribution using every dimension of the training data. The property of Markov
dependence is relevant between these spatially or temporaly dependent dimensions
of input vector as learning is done by keeping all but one dimension of input vector
constant. After learning on one dimension, the next one is altered and this
continues. More importantly we make use of the ergodic property of Markov chain (i.e. state transitions
are positive and they are not cyclic). This property provides the encouragement
to explore further and avoid any looping while learning the distribution with
Gibbs sampling. The random sampling from the distribution is the Monte Carlo part that ensures we are indeed
approximating the intended Boltzmann distribution.
No comments:
Post a Comment