Mej

After completing 12 years in software QA with a variety of test data, I was tempted to make a career shift into data science and decided to pursue this through a structured masters program. Though I love the three pillars - math, statistics and programming, I did not have an easy start as I am getting back to studies after a long gap of 14 years. As I began learning machine learning, visual analytics, data science, Python, Matlab, R, Tableau, Mondrian etc., I got excited of blogging so as to summarise my learning. I will try to make frequent posts and keep it simple. Looking forward for good learning and sharing time... Cheers, Mej!

Boltzmann Machine and Contrastive Divergence

Boltzmann machine is an extension of Hopfield networks that use stochastic binary neurons. These neurons are present in visible and hidden layers and the stochasticity is that each neuron can be in ON or OFF state.

Here’s a simpler explanation with an example. Boltzmann machines are mainly for associative memory. Imagine you bump into one of your classmates after a decade and try to recall the friend’s name; you are likely to do this by relating to the GCSE or bachelors class i.e. you do a associative memory recall. I believe Boltzmann machines were designed from this thought.

After the Boltzmann network is modelled, when you present a new data item at the visible layer it tries to recognise the pattern by triggering the appropriate hidden neurons. In order to model the network or the associations, we need to learn a joint probability distribution of hidden and visible neurons. But, learning a joint distribution is not easy and it is much more convenient to learn conditional distribution by assuming some independence. This led to Restricted Boltzmann Machines with the independence being the absence of connection between hidden neurons.

So how do we learn the joint distribution? If we try to learn the distribution analytically, it is a tedious task as computation involves learning the weights within and between hidden and visible layers. And this needs to be done for each dimension of every training data which is a tedious task; or in other words we have an intractable problem.

Indeed we are trying to learn a Boltzmann distribution here. It has the nice property that if we start sampling from anywhere in the distribution and continue doing so, it will approximate the distribution and eventually end up learning the statistical properties of the underlying data. We also try to simplify the problem by learning one of the variables at a time by sampling on the other. The idea of contrastive divergence is that we learn a simpler point estimate of one variable (say, visible or hidden) as a route to learning the other and thus approximating the larger Boltzmann distribution.  Contrastive divergence may be seen as a work around that eventually came to be accepted as a solution. All these wonderful ideas are implemented with Gibbs sampling (i.e. sample hidden given visible or sample visible given hidden) to learn the entire distribution.

As we start by clamping (in simple words, providing) training data to the visible layer, it fires hidden neurons and the hidden layer learns the features. But as the network learns the features, we like to know what the model believes in. This is done by generating data (or fantasies) from the model. This is called a free-run as we do not have keep any restrictions (like the clamping we did for the visible layer).  We can compare the network’s output and training data to validate network’s belief. (For e.g. the comparison of digit 2 in the lecture notes). In a nutshell, our intention is to minimise the Divergence in the Contrast (i.e. observed differences) between fantasy and training data and this is called minimisation of Contrastive Divergence.


Lastly, what is the Markov Chain Monte Carlo bit? In doing all these, we make a major assumption of Markov Chain property i.e. the subsequent state is dependent on current state. This is important while learning the distribution using every dimension of the training data. The property of Markov dependence is relevant between these spatially or temporaly dependent dimensions of input vector as learning is done by keeping all but one dimension of input vector constant. After learning on one dimension, the next one is altered and this continues. More importantly we make use of the ergodic property of Markov chain (i.e. state transitions are positive and they are not cyclic). This property provides the encouragement to explore further and avoid any looping while learning the distribution with Gibbs sampling. The random sampling from the distribution is the Monte Carlo part that ensures we are indeed approximating the intended Boltzmann distribution.

No comments:

Post a Comment

Wanna search?