Tensorflow Reproductions: Big Deep Simple MNIST

I am starting a new project to try and reproduce some core deep learning papers in TensorFlow from some of the big names.

The motivation: to understand how to build very deep networks and why they do (or don’t) work.

There are several papers that caught my eye, starting with

Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition (2010)
- I can find no implementation or data
Unifying Distillation and Privileged Information (2015)
- Also called student-teacher learning
- there is an implementation, but it is unclear what data was used

These papers set the foundation for looking at much larger, deeper networks such as

ResNet (Deep Residual Learning)
- there are several TensorFlow implementations. I don’t know which is best
Highway Networks
- see Jim Flemming’s post on a TensorFlow implementation
and FractalNet.
- an implementation is needed

FractalNet’s are particularly interesting since they suggest that very deep networks do not need student-teacher learning, and, instead, can be self similar. (which is related to very recent work on the Statistical Physics of Deep Learning, and the Renormalization Group analogy).

IMHO, it is not enough just to implement the code; the results have to be excellent as well. I am not impressed with the results I have seen so far, and I would like to flush out what is really going on.

Big Deep Simple Nets

The 2010 paper still appears to be 1 of the top 10 results on MNIST:

http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html

The idea is simple. They claim to get state-of-the-art accuracy on MNIST using a 5-layer MLP, but running a large number of epochs with just SGD, a decaying learning rate, and an augmented data set.

The key idea is that the augmented data set can provide, in practice, an infinite amount of training data. And having infinite data means that we never have to worry about overtraining because we have too many adjustable parameters, and therefore any reasonable size network will do the trick if we just run it long enough.

In other words, there is no convolution gap, no need for early stopping, or really no regularization at all.

This sounds dubious to me, but I wanted to see for myself. Also, perhaps I am missing some subtle detail. Did they clip gradients somewhere ? Is the activation function central ? Do we need to tune the learning rate decay ?

I have initial notebooks on github, and would welcome feedback and contributions, plus ideas for other papers to reproduce.

I am trying to repeat this experiment using Tensorflow and 2 kinds of augmented data sets:

InfiMNIST (2006) – provides nearly 1B deformations of MNIST
AlignMNIST (2016) – provides 75-150 epochs of deformed MNIST

(and let me say a special personal thanks to Søren Hauberg for providing this recent data set)

I would like to try other methods, such as the Keras Data Augmentation library (see below), or even the recent data generation library coming out of OpenAI.

Current results are up for

2 Layer AlignMNIST 75 epochs
5 LayerAlignMNIST 75 epochs
2 Layer InfiMNIST 500 epochs
5 Layer InfiMNIST 500 epochs

The initial results indicate that AlignMNIST is much better that InfiMNIST for this simple MLP, although I still do not see the extremely high, top-10 accuracy reported.

Furthermore, the 5-Layer InfiMNIST actually diverges after ~100 epochs. So we still need early stopping, even with an infinite amount of data.

It may be interesting try using the Keras ImageDataGenerator class, described in this related blog on “building powerful image classification models using very little data”

Also note that the OpenAI group as released a new paper and code for creating data used in generative adversarial networks (GANs).

I will periodically update this blog as new data comes in, and I have the time to implement these newer techniques.

Next, we will check in the log files and discuss the tensorboard results.

Comments, criticisms, and contributions are very welcome.

(chat on gitter )

Tensorflow Reproductions: Big Deep Simple MNIST

Big Deep Simple Nets

Published by Charles H Martin, PhD

Leave a comment Cancel reply

Big Deep Simple Nets

Share this:

Related

Published by Charles H Martin, PhD

Leave a comment Cancel reply