# Project Goals

• Explain in detail a model of a Deep Feedforward Network, or Multilayer Perceptron (MLP), that is:
• Modular and expandable
• Configurable to use any specified depth with any number of units per layer
• Provides a selection of activation functions for each hidden layer
• Multiple options for learning rate scheduling
• Implement and thoroughly document the model using only base Python and Numpy with careful consideration to numerical stability
• Demonstrate reproducibly running the model on the MNIST, CIFAR-10, and CIFAR-100 benchmark datasets
• Discuss the results

# Model Overview

## Description of Variables

• $$\bf{X}$$: The input matrix (rows are observations).
• $$\bf{y}$$: (True) classification vector of the rows of $$\bf{X}$$.
• $$\bf{Y}$$: A "wide" matrix version of $$\bf{y}$$; entries are 1 where the row observation belongs to the column class. Zero everywhere else.
• $$\bf{W}^{(k)}$$: The weights matrix of layer $$k$$
• $$\bf{b}^{(k)}$$: The bias vector of layer $$k$$
• $$\bf{Z}^{(k)}$$: The linear, pre-activation outputs of layer $$k$$
• $$\bf{\phi}^{(k)}$$: The activation function of layer $$k$$, applied elementwise to $$\bf{Z}^k$$
• $$\bf{H}^{(k)}$$: The activations of layer $$k$$
• $$\ell$$: The total number of layers (depth of the model)
• $$\bf{P}$$: The activations of the output layer ($$\bf{P}=\bf{H}^{(k)}$$)

Hyperparameters:

• $$\lambda$$: The regularization constant
• $$\eta$$: The learning rate
• Hyperparameters for learning schedules are described individually in their section.

# Model Components with Implementations

The first consideration of the implementation will be to check that the dimensions of $$\bf{X}$$, $$\bf{y}$$, $$\mathbb{W}=\{\bf{W}^{(l)}|1\leq i \leq \ell\}$$, and $$\mathbb{B}=\{\bf{b}^{(l)}|1\leq i \leq \ell\}$$.

import numpy as np

def mlp_check_dimensions(x, y, ws, bs):
"""
Return True if the dimensions in double_u and beta agree.
:param x: a list of lists representing the x matrix.
:param y: a list output values.
:param ws: a list of weight matrices (one for each layer)
:param bs: a list of biases (one for each layer)
:return: True if the dimensions of x, y, ws and bs match
"""
x = np.array(x)
y = np.array(y)
if x.shape[0] == y.shape[0]:
if len(ws) == len(bs):
for i, w in enumerate(ws):
# check weight and bias vectors
if ws[i].shape[1] == ws[i+1].shape[0]:
if ws[i].shape[1] == bs[i].shape[1]:
if ws[i+1].shape[1] == bs[i+1].shape[1]:
if x.shape[1] == ws[i].shape[0]:
return True
if ws[-1].shape[1] == bs[-1].shape[1] == y.shape[1]:
return True
return False

This subroutine will be the first called by the final run model routine. The run routine will only contunie provided the dimensions match.

## Forward Propagation Subroutines

Two activation functions are implemented for the hidden layers: hyperbolic tangent and the Rectified Linear Unit function (ReLU). This list can be easily added to by implementing other functions, though any additional functions will require additions to the gradient calculations in mlp_propagate_error below.

def mlp_ReLU(z):
"""
Return the Rectified Linear Unit
:param z: the input "affinities" for this neuron.
:return: hyperbolic tangent "squashing function"
"""
return np.maximum(np.zeros(np.shape(z)), z)

Hyperbolic tangent is prone to saturation, and its use is generally discouraged now [Goodfellow, et al.], but is simple enough to implement that it is included.

def mlp_tanh(z):
"""
Return the hyperbolic tangent of z using numpy.
:param z: the input "affinities" for this neuron.
:return: hyperbolic tangent "squashing function"
"""
return np.tanh(z)

Since this is a classification model, and it pairs well with the cost function used, the output layer is bound to be the softmax function $$\phi_\ell(z)=\text{softmax}(z)$$. The saturation of the softmax function is accounted for by the negative log-likelihood cost function. Here, careful consideration will need to be given to numerical stability. Specifically, with softmax, we run the risk of underflow when taking the exponent. Since softmax is invariant to the addition of a constant vector, we transform each row to have a maximum value of zero, i.e.:

$Z_{i,j} \leftarrow Z_{i,j}-\displaystyle\max_{k} Z_{i,k}$

def mlp_softmax(z):
"""
Return the softmax function at z using a numerically stable approach.
:param z: A real number, list of real numbers, or list of lists of numbers.
:return: The output of the softmax function for z with the same shape as z.
"""
for i, z_i in enumerate(z):
row_max = max(z_i)
for j, z_j in enumerate(z_i):
z[i][j] = z[i][j] - row_max
h = z.copy()
for i, z_i in enumerate(z):
row_sum = 0
for j, z_j in enumerate(z_i):
row_sum += np.exp(z[i][j])
for j, z_j in enumerate(z_i):
h[i][j] = np.exp(z[i][j]) / row_sum
return h

Next we need to detail forward propagation through a layer.

The first subroutine, mlp_linear_input, recieves layer input and returns that input as a linear function of that input: $$\bf{H}^{k-1}$$, h; $$\bf{W}^k$$, w; and biases $$\bf{b}^k$$, b. The output z=$$\bf{Z}^k$$ is defined as:

$\bf{Z}^k=\bf{H}^{k-1}\bf{W}^k+\bf{b}^k$

def mlp_linear_input(h, w, b):
"""
Return the layer input z as a function of h, w, and b.
:param h: the input from the previous layer.
:param w: the weights for this layer.
:param b: the biases for this layer.
:return: the linear network activations for this layer.
"""
z = h.dot(w) + b
return z

Next, we will apply a specified activation function $$\phi^{(k)}$$, phi to $$\bf{Z}^k$$ and return $$\bf{H}^k$$.

$\bf{H}^k=\phi^{(k)}(\bf{Z}^k)$

def mlp_feed_layer(h, w, b, phi):
"""
Return the output of a layer of the network.
:param h: The input to the layer (output of last layer).
:param w: The weight matrix for this layer.
:param b: The bias vector for this layer.
:param phi: The activation function for this layer.
:return: The output of this layer.
"""
h = phi(mlp_linear_input(h, w, b))
return h

We will need to keep the outputs for each layer to compute gradients later on.

This next function will therefore take as input our network input $$\bf{X}$$ and lists of activation functions $$\varphi=\left[ \phi^{(1)}, \phi^{(2)}, ... \phi^{(\ell)} \right]$$ and network parameters for each layer: $$\mathbb{W}=\left[ \bf{W}^{(1)}, \bf{W}^{(2)}, ... \bf{W}^{(\ell)} \right]$$, $$\mathbb{B}=\left[ \bf{b}^{(1)}, \bf{b}^{(2)}, ... \bf{b}^{(\ell)} \right]$$.

The output will be a list of outputs from each layer:

$\mathbb{H}=\left[ \bf{H}^{(1)}, \bf{H}^{(2)}, ... \bf{H}^{(\ell)} \right]$

def mlp_feed_forward(x, ws, bs, phis):
"""
Return the output of each layer of the network.
:param x: The input matrix to the network.
:param ws: The list of weight matrices for layers 1 to l.
:param bs: The list of bias vectors for layers 1 to l.
:param phis: The list of activation functions for layers 1 to l.
:return: The list of outputs for layers 0 to l
"""
hs = []
hs.append(x)
for i, phi_i in enumerate(phis):
hs.append(mlp_feed_layer(hs[i], ws[i], bs[i], phis[i]))
return hs

## Prediction Subroutines

The function below, mlp_predict_probs, will call mlp_feed_forward on the input matrix $$\bf{X}$$, list of network parameters ($$\mathbb{W}$$ and $$\mathbb{B}$$), and list of activation functions $$\varphi$$ and output the matrix of class probabilities $$\bf{P}$$.

$$\bf{P}$$ is comprised of row vectors; the $$n^{\text{th}}$$ entry of $$\bf{P}$$'s $$i^{\text{th}}$$ row is the predicted probability that the $$i^{\text{th}}$$ data point (in the $$i^{\text{th}}$$ row of $$\bf{X}$$) belongs to class $$n$$.

def mlp_predict_probs(x, ws, bs, phis):
"""
Return the output matrix of probabilities for input matrix 'x'.
:param x: The input matrix to the network.
:param ws: The list of weight matrices for layers 1 to l.
:param bs: The list of bias vectors for layers 1 to l.
:param phis: The list of activation functions for layers 1 to l.
:return: The output matrix of probabilities (p)
"""
hs = mlp_feed_forward(x, ws, bs, phis)
return hs[-1]

From $$\bf{P}$$, the function below, mlp_predict, will predict the class labels for $$\bf{X}$$ by observing the maximum probability in each row vector.

def mlp_predict(x, ws, bs, phis):
"""
Return the output vector of labels for input matrix 'x'.
:param x: The input matrix to the network.
:param ws: The list of weight matrices for layers 1 to l.
:param bs: The list of bias vectors for layers 1 to l.
:param phis: The list of activation functions for layers 1 to l.
:return: The output vector of class labels.
"""
p = mlp_predict_probs(x, ws, bs, phis)
y_hat = np.argmax(p, axis = 1)
return y_hat

## Training Subroutines

Cost Function. $$C$$ is the negative log-likelihood cross entropy cost function of the list of weights $$\mathbb{W}$$ and the list of biases $$\mathbb{B}$$, and it includes an $$\ell^2$$-norm regularization term, requiring a regularization hyperparameter $$\lambda$$. (Let $$m$$ be the number of rows/data points in $$\bf{X}$$ and $$c$$ the number of classifications). $$C$$ is given by

$C(\mathbb{W}, \mathbb{B})=-\frac{1}{m}\displaystyle\sum_{i=1}^m\displaystyle\sum_{k=1}^c \left( \bf{Y} \odot \text{ln} (\bf{P} + \epsilon) \right)^{(i)}_{k} +\frac{\lambda}{2}\displaystyle\sum_{k=1}^\ell \Vert \bf{W}^{(k)} \Vert^2$

where $$\bf{Y}$$ is defined such that:

$y^{(i)}_k=\begin{cases} 1 \quad &\text{if} \, y^{(i)}=k \\ 0 \quad &\text{otherwise} \\ \end{cases}$

We add a small $$\epsilon$$ within the logarithm to ensure we never take the logarithm of zero. For this implementation, $$\epsilon = 1 \times 10^{-8}$$.

The choice of negative log-likelihood is natural, as it will help invert the saturation of the output layer's multinoulli distributions. It is implemented by mlp_cost below.

def mlp_cost(x, y, ws, bs, phis, alpha):
"""
Return the cross entropy cost function with L2 regularization term.
:param x: a list of lists representing the x matrix.
:param y: a list of lists of output values.
:param ws: a list of weight matrices (one for each layer)
:param bs: a list of biases (one for each layer)
:param phis: a list of activation functions
:param alpha: the hyperparameter controlling regularization
:return: The cost function
"""

#Cross Entropy Cost
p = np.multiply(y, np.log(mlp_predict_probs(x, ws, bs, phis) + 1e-8))

crossEntCost = - np.sum(p) / len(y)

# Regularization Term
regTerm = 0
for i in range(len(ws)):
for j in range(len(ws[i])):
for k in range(len(ws[i][j])):
regTerm += ws[i][j][k] ** 2
regTerm *= alpha / 2

return crossEntCost + regTerm

Backpropagation. To compute the gradient of the cost function using backpropagation, we need the gradient of $$C$$ with respect to each layer's linear activation. We will call the propagated errors $$\bf{D}^{(k)}$$, the gradient of $$C$$ with respect to $$\bf{Z}^{(k)}$$ for layer $$k$$. The model uses either hyperbolic tangent or ReLU for hidden layers, and the output layer uses softmax.

Therefore, the gradient we will use for the output layer is:

$\bf{D}^{(\ell)}=\nabla _{\bf{Z}^{\ell}}C(\mathbb{W}, \mathbb{B})=\frac{1}{m}(\bf{P}-\bf{Y})$

Layers with hyperbolic tangent activation functions $$k$$ have the following gradient:

$\bf{D}^{(k)}=\nabla _{\bf{Z}^{k}}C(\mathbb{W}, \mathbb{B})=\bf{D}^{(k+1)}\left( \bf{W}^{(k+1)} \right)^\top \odot \left( 1- (\bf{H}^{(k)})^2 \right)$

Layers with ReLU activation functions $$k$$ have the following gradient:

$\bf{D}^{(k)}=\nabla _{\bf{Z}^{k}}C(\mathbb{W}, \mathbb{B})=\bf{D}^{(k+1)}\left( \bf{W}^{(k+1)} \right)^\top \odot \left( \text{bool}(\bf{H}^{(k)}) \right)$

$$\text{bool}$$ simply denotes that any non-zero entries of the matrix are a 1.

The following routine, mlp_propagate_error, takes the list of weights $$\mathbb{W}$$ and the list of biases $$\mathbb{B}$$, along with the activation functions of each layer and the list of outputs of each layer $$\mathbb{H}$$ we saved from mlp_feed_forward, and, utilizing the gradients above, it returns a list of propagated errors for each layer, $$\mathbb{D}$$:

$\mathbb{D}=\left[ \bf{D}^{(1)}, \bf{D}^{(2)}, ... \bf{D}^{(\ell)} \right]$

def mlp_propagate_error(x, y, ws, bs, phis, hs):
"""
Return a list containing the gradient of the cost with respect to z^(k)
for each layer.
:param x: a list of lists representing the x matrix.
:param y: a list of lists of output values.
:param ws: a list of weight matrices (one for each layer)
:param bs: a list of biases (one for each layer)
:param phis: a list of activation functions
:param hs: a list of outputs for each layer include h^(0) = x
:return: A list of gradients of C with respect to z^(k) for k=1..l
"""
bigD = []
for i in range(len(hs) - 1, 0, -1):
if (i == len(hs) - 1):
bigD.append((hs[-1] - y) / len(y))
else:
dKPlusOne = bigD[len(hs) - (i + 2)]
if (phis[i] == mlp_tanh):
#tanh
hSqrd = hs[i] ** 2

matMUL = np.matmul(dKPlusOne, np.transpose(ws[i]))

compMUL = np.multiply(matMUL, 1 - hSqrd)
else:
#ReLU
hSqrd = hs[i]
hSqrd[hSqrd > 0] = 1

matMUL = np.matmul(dKPlusOne, np.transpose(ws[i]))

compMUL = np.multiply(matMUL, hSqrd)

bigD.append(compMUL)
returnArray = []
for i in range(len(bigD) - 1, -1, -1):
returnArray.append(bigD[i])
return returnArray

Gradient. With the above, we can compute the gradient with respect to the network parameters $$\mathbb{W}$$ and $$\mathbb{B}$$. (Reminder: $$\lambda$$ is the regularization hyperparameter.)

$\nabla _{\bf{W}^{k}}C(\mathbb{W}, \mathbb{B})=\left( \bf{H}^{(k-1)} \right)^\top \bf{D}^{(k)} + \lambda \bf{W}^{(k)}$

$\nabla _{\bf{B}^{k}}C(\mathbb{W}, \mathbb{B})= \bf{1} ^\top \bf{D}^{(k)}$

Note that $$\bf{1}$$ is an $$m \times 1$$ matrix of ones.

Thus, the routine below, mlp_gradient returns lists of these gradients for $$\mathbb{W}$$ and $$\mathbb{B}$$ for each layer:

$\nabla _{\mathbb{W}}C(\mathbb{W}, \mathbb{B})=\left[ \nabla _{\bf{W}^{1}}C(\mathbb{W}, \mathbb{B}), \nabla _{\bf{W}^{2}}C(\mathbb{W}, \mathbb{B}), ... \nabla _{\bf{W}^{\ell}}C(\mathbb{W}, \mathbb{B}) \right]$

$\nabla _{\mathbb{B}}C(\mathbb{W}, \mathbb{B})=\left[ \nabla _{\bf{b}^{1}}C(\mathbb{W}, \mathbb{B}), \nabla _{\bf{b}^{2}}C(\mathbb{W}, \mathbb{B}), ... \nabla _{\bf{b}^{\ell}}C(\mathbb{W}, \mathbb{B}) \right]$

def mlp_gradient(x, y, ws, bs, phis, alpha):
"""
Return a list containing the gradient of the cost with respect to z^(k)
for each layer.
:param x: a list of lists representing the x matrix.
:param y: a list of lists of output values.
:param ws: a list of weight matrices (one for each layer)
:param bs: a list of biases (one for each layer)
:param phis: a list of activation functions
:param alpha: the regularization term
:return: A list of gradients of J with respect to z^(k) for k=1..l
"""
hss = mlp_feed_forward(x, ws, bs, phis)
ds = mlp_propagate_error(x, y, ws, bs, phis, hss)
gws = []
gbs = []
for k in range(0, len(hss) - 1):
gws.append(np.matmul(np.transpose(hss[k]), ds[k]) + alpha * ws[k])
gbs.append(np.matmul(np.transpose(np.ones(len(x))), ds[k]))
return [gws, gbs]

## Learning Rate Scheduling

Gradient descent requires a learning rate hyperparameter $$\eta$$. For configurability and exploration, a few different approaches to learning rate scheduling, or how this learning rate changes over iterations of training, will be implemented for this model, and the end user may choose which to use.

Constant. One approach is to simply allow this hyperparameter to remain constant, that is,

$\eta_n=\eta_0$

def mlp_constant_eta(iteration, eta_details):
"""
Provides a constant learning rate.
:param iteration: the current iteration of gradient descent
:param eta_details[0]: the desired constant learning rate
:return: a constant learning rate equal to the input
"""
return eta_details[0]

Step-Based. Alternatively, we could have learning rate decrease in steps. With this approach, starting at a learning rate $$\eta_0$$, every $$r$$ iterations, the learning rate drops by a proportion $$d$$:

$\eta_n=\eta_0d^{\left \lfloor \frac{1+n}{r} \right \rfloor}$

For example, $$r=15$$ and $$d=0.5$$ halves the learning rate every 15 iterations.

def mlp_step_based_eta(iteration, eta_details):
"""
Decreases the learning rate by proportion d every r iterations
of the training method.
:param iteration: the current iteration of gradient descent
:param eta_details[0]: the initial learning rate (\eta_0)
:param eta_details[1]: the droprate (r)
:param eta_details[2]: the decay proportion (d)
:return: the step-based learning rate for an iteration
"""
return eta_details[0] * np.power(eta_details[2], np.floor((1 + iteration) / eta_details[1]))

Exponential. The final approach implemented for the model will be a learning rate that decays exponentially. Starting with an initial value $$\eta_0$$, the learning rate will decay with a decay hyperparameter of $$d$$:

$\eta_n=\eta_o\text{e}^{-dn}$

def mlp_exponential_eta(iteration, eta_details):
"""
Decreases the learning rate by proportion d every r iterations
of the training method.
:param iteration: the current iteration of gradient descent
:param eta_details[0]: the initial learning rate (\eta_0)
:param eta_details[1]: the decay rate (d)
:return: the step-based learning rate for an iteration
"""
return eta_details[0] * np.exp(-eta_details[1] * iteration)

Numerically, we need not concern ourselves with the possiblity of underflow here, as a learning rate that small is simply impractical.

## GD and SGD

Gradient descent. Finally, with the gradient calculated and scheduling for our learning rate handled, we can perform gradient descent to reduce the cost function.

We do this by mocing the network parameters in the negative direction of the gradient scaled by the learning rate. For iteration $$i$$:

$\bf{W}^{(k)}_i = \bf{W}^{(k)}_{i-1} - \eta_i \cdot \nabla _{\bf{W}^{k}_{i-1}}C(\mathbb{W}, \mathbb{B})$

$\bf{B}^{(k)}_i = \bf{B}^{(k)}_{i-1} - \eta_i \cdot \nabla _{\bf{B}^{k}_{i-1}}C(\mathbb{W}, \mathbb{B})$

def mlp_gradient_descent(x, y, ws0, bs0, phis, alpha, n_iter, eta_func, eta_details):
"""
Uses gradient descent to estimate the weights, ws, and biases, bs,
that reduce the cost.
:param x: a list of lists representing the x matrix.
:param y: a list of lists of output values.
:param ws0: a list of initial weight matrices (one for each layer)
:param bs0: a list of initial biases (one for each layer)
:param phis: a list of activation functions
:param alpha: the hyperparameter controlling regularization
:param n_iter: the number of iterations
:param eta_func: the learning rate schedule
:param eta_details: hyperparameters for the learning rate schedule
:return: the estimate weights, the estimated biases, record of cost
"""
W = ws0.copy()
B = bs0.copy()

for i in range(n_iter):
gws, gbs = mlp_gradient(x, y, W, B, phis, alpha)
for k in range(len(W)):
W[k] -= eta_func(k, eta_details) * gws[k]
B[k] -= eta_func(k, eta_details) * gbs[k]
return [W, B]

Stochastic Gradient descent. To save time, we can also use Stochastic Gradient descent, or SGD. Instead of calculating the gradient for every datum in the training data, we instead can use "minibatches," or random samples from the training data of size $$t$$. For results to be reproducible, it is necessary to set the seed of numpy's PRNG before utilizing SGD.

def mlp_SGD(x, y, t, ws0, bs0, phis, alpha, n_iter, eta_func, eta_details):
"""
Uses stochastic gradient descent to estimate the weights, ws, and biases, bs,
that reduce the cost.
:param x: a list of lists representing the x matrix.
:param y: a list of lists of output values.
:param t: sample size of the minibatches.
:param ws0: a list of initial weight matrices (one for each layer)
:param bs0: a list of initial biases (one for each layer)
:param phis: a list of activation functions
:param alpha: the hyperparameter controlling regularization
:param n_iter: the number of iterations
:param eta_func: the learning rate schedule
:param eta_details: hyperparameters for the learning rate schedule
:return: the estimate weights, the estimated biases, record of cost
"""
W = ws0.copy()
B = bs0.copy()

# Set Seed for reproducibility
np.random.seed(1)

for i in range(n_iter):
# Random shuffle
perm = np.random.permutation(x.shape[0])

# Minibatch sample
x_minibatch = (x[perm])[0:t]
y_minibatch = (y[perm])[0:t]

gws, gbs = mlp_gradient(x_minibatch, y_minibatch, W, B, phis, alpha)
for k in range(len(W)):
W[k] -= eta_func(k, eta_details) * gws[k]
B[k] -= eta_func(k, eta_details) * gbs[k]
return [W, B]

# Benchmark Data Preparation

Before we see how to initialize and run the model, we must prepare our data for use.

Each mlp_*_data_prep will return the following for the given dataset:

• x_train - The training data matrix ($$\bf{X}$$)
• x_test - The test data matrix
• y_train - The training labels ($$\bf{y}$$)
• y_test - The test labels
• y_matrix_train - The training labels matrix ($$\bf{Y}$$) used in the cost function
• y_matrix_test - The test labels matrix used in the cost function

## MNIST

The MNIST dataset is nearly ubiquitous. The dataset contains 28x28 greatscale images of handwritten digits, 0-9. The labels are of the digits; thus, there are ten categories, one for each digit. The training data contains 60,000 observations, and the testing data contains 10,000 observations.

Here, we are starting with .npy versions of the training and test data. Below, we extract them and normalize the training $$\bf{X}$$.

def mlp_MNIST_data_prep():
"""
Return the MNIST dataset.
:return: The x matrix, y vector, and the wide matrix version of y
"""

# create the Y training wide matrix
m = x_train.shape[0]
c = (max(y_train) + 1) ## number of categories
y_matrix_train = np.zeros((m,c))
y_matrix_train[(range(m), y_train.astype(int))] = 1

# extract X and Y for testing

# Create Y test wise matrix
m = x_test.shape[0]
c = (max(y_test) + 1) ## number of categories
y_matrix_test = np.zeros((m,c))
y_matrix_test[(range(m), y_test.astype(int))] = 1

# Normalize training X
x_mean = np.mean(x_train, axis=0)
x_stdev = np.std(x_train, axis=0)
for i, k in enumerate(x_stdev):
if k == 0:
x_stdev[i] = 1
x_train = (x_train - x_mean) / x_stdev
x_test = (x_test - x_mean) / x_stdev

return x_train, x_test, y_train, y_test, y_matrix_train, y_matrix_test

## CIFAR-10

The CIFAR datasets contain data for tiny images (32x32 RGB pixels). The CIFAR-10 dataset has 10 different classifications for these images (e.g., airplane, bird, cat, etc.), and the CIFAR-100 dataset has 100 classes.

For the CIFAR datasets, we need a subroutine to "unpickle" the dictionaries.

def unpickle(file):
"""
Return an "unpickled" file.
:param file: file to be "unpickled"
:return: the unpickled file
"""
import pickle
with open(file, 'rb') as fo:
return dict

For CIFAR-10, training data comes in 5 batches, which will need to be unpickled individually and appended together. The test data is in its own batch. Once we have these unpickled, we can convert the data to numpy.ndarrays and ensure the $$\bf{X}$$, $$\bf{y}$$, $$\bf{Y}$$, for training and testing are of the right dimensions. Finally, we will normalize the training $$\bf{X}$$.

def mlp_CIFAR_10_data_prep():
"""
Return the prepared data using the provided MNIST 1000 dataset.
:return: The x matrix, y vector, and the wide matrix version of y
"""

# Unpickle the data
batch_1 = unpickle("CIFAR10/data_batch_1")
batch_2 = unpickle("CIFAR10/data_batch_2")
batch_3 = unpickle("CIFAR10/data_batch_3")
batch_4 = unpickle("CIFAR10/data_batch_4")
batch_5 = unpickle("CIFAR10/data_batch_5")
test_batch = unpickle("CIFAR10/test_batch")

# Extract X for training
x_train = np.array(batch_1["data"])
x_train = np.append(x_train, batch_2["data"], axis=0)
x_train = np.append(x_train, batch_3["data"], axis=0)
x_train = np.append(x_train, batch_4["data"], axis=0)
x_train = np.append(x_train, batch_5["data"], axis=0)

# Extract y for training
y_train = np.array(batch_1["labels"])
y_train = np.append(y_train, batch_2["labels"])
y_train = np.append(y_train, batch_3["labels"])
y_train = np.append(y_train, batch_4["labels"])
y_train = np.append(y_train, batch_5["labels"])

# create the Y training wide matrix
m = x_train.shape[0]
c = (max(y_train) + 1) ## number of categories
y_matrix_train = np.zeros((m,c))
y_matrix_train[(range(m), y_train.astype(int))] = 1

# extract X and Y for testing
x_test = np.array(test_batch["data"])
y_test = np.array(test_batch["labels"])

# Create Y test wise matrix
m = x_test.shape[0]
c = (max(y_test) + 1) ## number of categories
y_matrix_test = np.zeros((m,c))
y_matrix_test[(range(m), y_test.astype(int))] = 1

# Normalize training X
x_mean = np.mean(x_train, axis=0)
x_stdev = np.std(x_train, axis=0)
for i, k in enumerate(x_stdev):
if k == 0:
x_stdev[i] = 1
x_train = (x_train - x_mean) / x_stdev
x_test = (x_test - x_mean) / x_stdev

return x_train, x_test, y_train, y_test, y_matrix_train, y_matrix_test

## CIFAR-100

The CIFAR-100 dataset is very similar to the CIFAR-10 dataset, only slightly simpler: the data comes in only two batches to be unpickled, the train batch and the test batch.

def mlp_CIFAR_100_data_prep():
"""
Return the prepared data using the provided MNIST 1000 dataset.
:return: The x matrix, y vector, and the wide matrix version of y
"""

# Unpickle the data
train_batch = unpickle("CIFAR100/train")
test_batch = unpickle("CIFAR100/test")

# Extract X for training
x_train = np.array(train_batch["data"])

# Extract y for training
y_train = np.array(train_batch["fine_labels"])

# create the Y training wide matrix
m = x_train.shape[0]
c = (max(y_train) + 1) ## number of categories
y_matrix_train = np.zeros((m,c))
y_matrix_train[(range(m), y_train.astype(int))] = 1

# extract X and Y for testing
x_test = np.array(test_batch["data"])
y_test = np.array(test_batch["fine_labels"])

# Create Y test wise matrix
m = x_test.shape[0]
c = (max(y_test) + 1) ## number of categories
y_matrix_test = np.zeros((m,c))
y_matrix_test[(range(m), y_test.astype(int))] = 1

# Normalize training X
x_mean = np.mean(x_train, axis=0)
x_stdev = np.std(x_train, axis=0)
for i, k in enumerate(x_stdev):
if k == 0:
x_stdev[i] = 1
x_train = (x_train - x_mean) / x_stdev
x_test = (x_test - x_mean) / x_stdev

return x_train, x_test, y_train, y_test, y_matrix_train, y_matrix_test

# Initialization and Running the Model

Initializing the Model. First, we will create a subroutine to initialize our model. The argument layer_widths is a list of layer widths (i.e., number of units in each layer of the model.) The depth of our model is the number of layers, $$\ell$$. It will return a list of lists of parameters and activation functions that define the model. As a reminder, these are:

$\mathbb{W}=\left[ \bf{W}^{(1)}, \bf{W}^{(2)}, ... \bf{W}^{(\ell)} \right]$ $\mathbb{B}=\left[ \bf{b}^{(1)}, \bf{b}^{(2)}, ... \bf{b}^{(\ell)} \right]$ $\varphi=\left[ \phi^{(1)}, \phi^{(2)}, ... \phi^{(\ell)} \right]$

def mlp_initialize(layer_widths, phis):
"""
Use Numpy's random package to initialize a list of weights,
a list of biases, and a list of activation functions for
the number of units per layer provided in the argument.
To pass the tests you will need to initialize the matrices
in the following order:
ws1, bs1, ws2, bs2, ..., wsl, bsl.
:param layer_widths: a list of layer widths
:return: a list of weights, a list of biases, and a list of
phis, one for each layer
"""
l = len(layer_widths)
ws = []
bs = []
for k in range(l - 1):
ws.append(np.random.normal(0, 0.1, (layer_widths[k], layer_widths[k + 1])))
bs.append(np.abs(np.random.normal(0, 0.1, (1, layer_widths[k + 1]))))

# Append softmax for output activation
phis.append(mlp_softmax)
return [ws, bs, phis]

## Running the Model

To run the model, we utilize the routines above, returning x_train, y_matrix_train, y_train, x_test, y_matrix_test, y_test, ws0, bs0, ws_hat, bs_hat, train_err, and test_err.

Calling the routine:

# Set the seed
np.setSeed(1)
x_train, y_matrix_train, y_train, x_test, y_matrix_test, y_test, ws0, bs0, ws_hat, bs_hat, train_err, test_err = mlp_run_...(n_iter, u_layers, phis, alpha, eta_func, eta_details)

Be sure to set the seed for reproducibility of results.

The final routine implementations for using Gradient descent and SGD on the MNIST data are shown below, and the implementations for the CIFAR dataset are excluded only for brevity. The CIFAR implementations are nearly identical - the only changes are the name of the routine and the data preparation called.

def mlp_run_mnist_GD(n_iter, u_layers, phis, alpha, eta_func, eta_details):
"""
Prepare the MNIST data from the local directory and run desired optimization-
based MLP for the specified number of iterations, with the specified
units per layer, learning schedule, and regularization hyperparameter,
to estimate the parameters on the training data.

:param n_iter: the number of iterations of training
:param u_layers: array of units per layer
:param alpha: regularization hyperparameter
:param eta_func: learning rate schedule
:param eta_details: learning rate schedule details
:return: x_train, y_matrix_train, y_train, x_test, y_matrix_test, y_test,
ws0, bs0, ws_hat, bs_hat, train_err, test_err
"""

# Prepare Data
x_train, x_test, y_train, y_test, y_matrix_train, y_matrix_test = mlp_MNIST_data_prep()

# Initialize MLP
ws0, bs0, phis = mlp_initialize(u_layers, phis)

ws_hat, bs_hat = mlp_gradient_descent(x_train, y_matrix_train, ws0, bs0, phis, alpha, n_iter, eta_func, eta_details)

# Calculate Final training and generalization errors
y_hat_train = mlp_predict(x_train, ws_hat, bs_hat, phis)
y_hat_test = mlp_predict(x_test, ws_hat, bs_hat, phis)
train_err = 1-(y_hat_train == y_train).mean()
test_err = 1-(y_hat_test == y_test).mean()

return [x_train, y_matrix_train, y_train, x_test, y_matrix_test, y_test, ws0, bs0, ws_hat, bs_hat, train_err, test_err]

SGD's implementation is nearly identical to Gradient descent, only with the added parameter of the number of minibatches and calling SGD as the optimization technique.

def mlp_run_mnist_SGD(n_iter, minibatches, u_layers, phis, alpha, eta_func, eta_details):
"""
Prepare the MNIST data from the local directory and run desired optimization-
based MLP for the specified number of iterations, with the specified
units per layer, learning schedule, and regularization hyperparameter,
to estimate the parameters on the training data.

:param n_iter: the number of iterations of training
:param minibatches: minibatch size
:param u_layers: array of units per layer
:param alpha: regularization hyperparameter
:param eta_func: learning rate schedule
:param eta_details: learning rate schedule details
:return: x_train, y_matrix_train, y_train, x_test, y_matrix_test, y_test,
ws0, bs0, ws_hat, bs_hat, train_err, test_err
"""

# Prepare Data
x_train, x_test, y_train, y_test, y_matrix_train, y_matrix_test = mlp_MNIST_data_prep()

# Initialize MLP
ws0, bs0, phis = mlp_initialize(u_layers, phis)

# Train with SGD
ws_hat, bs_hat = mlp_SGD(x_train, y_matrix_train, minibatches, ws0, bs0, phis, alpha, n_iter, eta_func, eta_details)

# Calculate Final training and generalization errors
y_hat_train = mlp_predict(x_train, ws_hat, bs_hat, phis)
y_hat_test = mlp_predict(x_test, ws_hat, bs_hat, phis)
train_err = 1-(y_hat_train == y_train).mean()
test_err = 1-(y_hat_test == y_test).mean()

return [x_train, y_matrix_train, y_train, x_test, y_matrix_test, y_test, ws0, bs0, ws_hat, bs_hat, train_err, test_err]

# Results and Discussion

Naturally, since this model is implemented in such a way that it is ran on the CPU of a machine, it is not very fast. This makes it very time consuming to experiment as much as is desired in order to determine an optimal architecture of the MLP for a given problem. This implementation and model is more educational than practical, and in the future, I hope to develop models with CUDA C.

While there was some improvement over iterations of SGD for the CIFAR sets, the results were predictably not great. They could be better with different architectures, but again, this particular implementation does not lend itself easily to discovering better architectures.

For the sake of analyzing the results of this implementation, the architectures of the MLPs for each set of benchmark data are chosen fairly arbitrarily. The results were then used to speculate about the quality of the values of hyperparameters and the general architecture, and several models were ran on each dataset.

All models were ran with a regularization hyperparameter $$\lambda = 0.05$$ with a minibatch size of 32. The learning schedules used with MNIST had hyperparameters of $$\eta_0=0.1$$, $$r=10$$ and $$d=0.891761577179$$ for step-based and $$\eta_0=0.1$$ and $$d=0.011512925465$$ for exponential. This was done such that both schedules begin with the largest constant learning rate ran (0.1) and end with the smallest (0.01).

• The MNIST model
• A single hidden layer with 450 units ran with different activation functions
• Once with ReLU
• Once with hyperbolic tangent
• Ran with multiple learning schedules
• The CIFAR-10 model
• Two different architectures
• The first used a single hidden layer of 1024 units
• The second used a single hidden layer of 784 units
• Ran with multiple constant learning rates
• Could be better with different architecture (not practical with this implementation)
• The CIFAR-100 model
• Ran with a single hidden layer of 1024 units
• Ran with multiple constant learning rates
• Predicatbly performed very poorly
• Could be better with different architecture (not practical with this implementation)

## Comparing Learning Schedules

Though it may appear as though the exponential and constant 0.1 learning rates in the MNIST models below are excluded, they are both so nearly identical to the step-based error that the step-based line covers them. All three are relatively unstable in their descent, causing their training and generalization errors to be unstable as well, likely due to learning rates being too high.

The closeness of the training and sample generalization errors support the value of the regularization hyperparameter chosen.

# Citations

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Krizhevsky, Alex. 2009. “Learning Multiple Layers of Features from Tiny Images,” 32–33. https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2018. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Original code presented this site is under Creative Commons licensing which makes it possible to re-use this content for non-commercial purposes.