An Exposition on the Propriety of Restricted Boltzmann Machines

http://bit.ly/jsm2016-rbm

An exposition on the propriety of restricted Boltzmann machines

https://www.flickr.com/photos/kevin53/13427959404/in/photolist-mszNvs-9MQYEj-4yexiR-ks7nip-6o3CdZ-ngtetn-canv9W-bmdAKd-4vidoh-6PLNA8-5gvtyL-3MDEAo-8SsP23-ks6CS4-oJ44cc-93LLtu-jYhtWc-e4nSK4-pJrmi4-62mmJ-dANTW2-8zyj8N-ks7nFP-cSDsyd-pmTfZt-9dL7vS-nyf63n-8q76xZ-dzE85u-63vAS1-djhyF4-pZKfE7-dr9oRc-8i1GXY-jmsNrz-4wC15t-4wBZMp-8XmJa8-co9uys-63nQSn-a5tzkd-fqbZm4-au3EME-mszUiu-sbVvKY-ajAP7X-9kUypA-o4stRL-J5n3or-7icjfY

Andee Kaplan, Daniel Nordman, and Stephen Vardeman
Iowa State University
July 31, 2016
JSM - Chicago, IL




http://bit.ly/jsm2016-rbm

Restricted Boltzmann machines

What is this?

Restricted Boltzmann machine (RBM) with two layers - hidden (\(\mathcal{H}\)) and visible (\(\mathcal{V}\)) (Smolensky 1986).



Used for image classification. Each image pixel is a node in the visible layer. The output creates features, passed to supervised learning.

Joint Distribution

Let \(x = \{h_1, ..., h_H, v_1, ...,v_V\}\) represent the states of the visible and hidden nodes in an RBM. Then the probability each node taking the value corresponding to \(x\) is:

\[ f_{\theta} (x) = \frac{\exp\left(Q(x)\right)}{\sum\limits_{x \in \mathcal{X}}\exp\left(Q(x)\right)} \]

Where \(Q(x) = \sum\limits_{i = 1}^V \sum\limits_{j=1}^H \theta_{ij} v_i h_j + \sum\limits_{i = 1}^V\theta_{v_i} v_i + \sum\limits_{j = 1}^H\theta_{h_j} h_j\) denotes the neg-potentional function of the model, having support set \(\mathcal{S}\).

Deep learning

“Deep Boltzmann machine” - multiple single layer restricted Boltzmann machines with the lower stack hidden layer acting as the visible layer for the higher stacked model



Claimed ability to learn “internal representations that become increasingly complex” (Salakhutdinov and Hinton 2009), used in classification problems.

Why do I care?

Current heuristic fitting methods seem to work for classification. Beyond classification, RBMs are generative models:

To generate data from an RBM, we can start with a random state in one of the layers and then perform alternating Gibbs sampling. (Hinton, Osindero, and Teh 2006)

Can we fit a model that generates data that looks like data?

Degeneracy, instability, and uninterpretability. Oh my!

Near-degeneracy

The highly flexible nature of the RBM (\(H + V + HV\) parameters) makes three characteristics of model impropriety of particular concern.

Characteristic Detection
Disproportionate amount of probability placed on only a few elements of the sample space by the model (Handcock et al. 2003) If random variables in \(Q(\cdot)\) have a collective mean \(\mu(\theta)\) close to the boundary of the convex hull of \(\mathcal{S}\).

Instability

Let \(R(\theta) = \max_{v} \max_{h}Q(x) - \min_{v}\max_{h}Q(x) - H\log 2\).

Characteristic Detection
Small changes in natural parameters result in large changes in probability masses, excessive sensitivity (Schweinberger 2011). If \(R(\theta)/V\) is large, then the the maximum log-likelihood ratio of two images that differ in only one pixel is large.

Uninterpretability

Characteristic Detection
Due to the existence of dependence, marginal mean-structure no longer maintained (Kaiser 2007). If the magnitude of the difference between model expectations and expectations under independence (dependence parameters of zero), \(\left\vert E( X \vert \theta) -E( X \vert \emptyset ) \right\vert\), is large.

Manageable (a.k.a. small) examples

RBMs easily are near-degenerate, unstable, and uninterpretable for large portions of parameter space.

Data coding to mitigate degeneracy

Convex hulls of the statistic space for a toy RBM with \(V = H = 1\) for \(\{0,1\}\) and \(\{-1,1\}\)-encoding enclosed by an unrestricted hull of 3-space.

Data coding to mitigate degeneracy (cont’d)

Bayesian model fitting

A tale of three methods

Posterior distributions of images

Wrapping up

Thank you!

References

Handcock, Mark S, Garry Robins, Tom AB Snijders, Jim Moody, and Julian Besag. 2003. “Assessing Degeneracy in Statistical Models of Social Networks.” Working paper.

Hinton, Geoffrey E, Simon Osindero, and Yee-Whye Teh. 2006. “A Fast Learning Algorithm for Deep Belief Nets.” Neural Computation 18 (7). MIT Press: 1527–54.

Kaiser, Mark S. 2007. “Statistical Dependence in Markov Random Field Models.” Statistics Preprints Paper 57. Digital Repository @ Iowa State University. http://lib.dr.iastate.edu/stat_las_preprints/57/.

Li, Jing. 2014. “Biclustering Methods and a Bayesian Approach to Fitting Boltzmann Machines in Statistical Learning.” PhD thesis, Iowa State University; Graduate Theses; Dissertations. http://lib.dr.iastate.edu/etd/14173/.

Salakhutdinov, Ruslan, and Geoffrey E Hinton. 2009. “Deep Boltzmann Machines.” In International Conference on Artificial Intelligence and Statistics, 448–55.

Schweinberger, Michael. 2011. “Instability, Sensitivity, and Degeneracy of Discrete Exponential Families.” Journal of the American Statistical Association 106 (496). Taylor & Francis: 1361–70.

Smolensky, P. 1986. “Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1.” In, edited by David E. Rumelhart, James L. McClelland, and CORPORATE PDP Research Group, 194–281. Cambridge, MA, USA: MIT Press. http://dl.acm.org/citation.cfm?id=104279.104290.

Zhou, Wen. 2014. “Some Bayesian and Multivariate Analysis Methods in Statistical Machine Learning and Applications.” PhD thesis, Iowa State University; Graduate Theses; Dissertations. http://lib.dr.iastate.edu/etd/13816/.