Neural Networks

Caleb O'Neel

Learning objectives

Through completing this project I will have a better understanding of...

  1. Identify key hyperparameters in neural networks and how they can impact model training and fit
  2. Build, tune the parameters of, and apply feed-forward neural networks to data
  3. Implement and explain each and every part of a standard fully-connected neural network and its operation including feed-forward propagation, backpropagation, and gradient descent.
  4. Apply a standard neural network implementation and search the hyperparameter space for optimized application.
  5. Develop a detailed understanding of the math and practical implementation considerations of neural networks, one of the most widely used machine learning tools.

1

Get to know your networks

The goal of this exercise is to better understand some of the key parameters used in neural networks so that you can be better prepared to tune your model. We'll be using the example data and data generation function below throughout this exercise.

The key parameters we want to explore the impact of are: learning rate, batch size, regularization coefficient, and the model architecture (number of layers and the number of nodes per layer). We'll explore each of these and determine an optimized configuration of the network for this problem. For all of the settings we'll explore, we'll assume the following default hyperparameters for the model (we'll use scikit learn's MLPClassifier as our neural network model):

You'll notice we're eliminating early stopping so that we train the network the same amount for each setting. This allows us to compare the operation of the neural network while holding that value constant. Typically the amount of training would be another parameter to analyze the performance of.

(a) Visualize the impact of different hyperparameter settings. Starting with the default settings above make the following changes (only change one hyperparameter at a time). For each setting plot the decision boundary on the training data (since there are 3 training sets provided, use the first one to train on):

  1. Vary the architecture (hidden_layer_sizes) by changing the number of nodes per layer while keeping the number of layers constant 2: (2,2), (5,5), (30,30)
  2. Vary the learning rate: 0.0001, 0.01, 1
  3. Vary the regularization: 0, 1, 10
  4. Vary the batch size: 5, 50, 500

As you're exploring these settings, visit this website, the Neural Network Playground, which will give you the chance to interactively explore the impact of each of these parameters not only on the mode output, but will also provide insight into a number of other important aspects of neural networks including: learning curves, batch size, and most importantly, the output of each intermediate neuron so that you can visualize the how neurons interact allowing you to combine them for more complex, nonlinear decision boundaries. As you're noting this, experiment by adding or removing hidden layers and neurons per layer. Vary the learning rate, regularization, and other settings.

(b) Now with some insight into which settings may work better than others, let's more fully explore the performance of these different settings in the context of our validation dataset. Holding all else constant (with the default settings mentioned above), vary each of the following parameters as specified below. Train your algorithm on the training data, and evaluate the performance of your algorithm on the validation dataset (here, overall accuracy is a reasonable performance metric since the classes are balanced and we don't weight one type of error as more important than the other); therefore, use the score method of the MLPClassifier for this. Create plot of accuracy vs each parameter you vary (this will be three plots).

  1. Vary learning rate logarithmically from $10^{-5}$ to $10^{0}$ with 20 steps
  2. Vary the regularization parameter logarithmically from $10^{-8}$ to $10^2$ with 20 steps
  3. Vary the batch size over the following values: $[1,3,5,10,20,50,100,250,500]$

For each of these cases:

Based upon the results of the models above, I have selected the hyperparameters for Learning Rate, Regularization, and Batch size that corresponded to the highest model score on the validation data. By selecting the

The optimal hyperparameteres are:

(c) Next we want to explore the impact of the model architecture but this means varying two parameters instead of one as above. To do this, evaluate the validation accuracy resulting from training the model using each pair of possible numbers of nodes per layer and number of layers from the lists below. We will assume that for any given configuration the number of nodes in each layer is the same (e.g. (2,2,2) and (25,25) are valid, but (2,5,3) is not). Use the optimized values for learning rate, regularization, and batch size selected from section (b).

After testing and plotting the scores on this heat map, I beleive that 10 nodes and 4 layers yield the best reults. Although this is larger parameter than I would ideally want, it receives the best score by some margin compared to the models that are smaller than it. The parameters with similay scores to it are all larger making this the best option from the available data.

(d) Based the optimal choice of hyperparameters and train your model with your optimized hyperparameters on all the training data (all three sets) AND the validation data (this is provided as X_train_plus_val and y_train_plus_val).

(e) Automated hyperparameter search. The manual, greedy approach (setting one or two parameters at a time holding the rest constant), provides good insights into how the neural network hyperparameters impacts model fitting for this particular training process. However, it does limit our ability to more deeply search the hyperparameter space. Now we'll use a Scikit-Learn tool to search our hyperparameter space. Use RandomizedSearchCV to select the hyperparameters by training on ALL of the training and validation data (it will perform cross validation internally). You can use this example as a template for how to do this. Grid search searches all possible combinations of values entered as possible values. Doing this over a large hyperparameter space for a model that takes awhile to run is intractable. Random search has been shown to be surprisingly effective in these situations at identifying excellent

-Set the number of iterations to at least 25 (you'll look at 25 random pairings of possible parameters). You can go as high as you want, but it will take longer the larger this value is.

If you run this on Colab or any system with multiple cores, set the parameter n_jobs to -1 to use all available cores for more efficient training through parallelization

You'll need to set the range or distribution of the parameters you want to sample from. Search over the same ranges as in previous problems (except this time, you'll search over all the parameters at once). You can use lists of values for batch_size, loguniform for the learning rate and regularization parameter, and a list of tuples for the hidden_layer_sizes parameter.

Once the model is fit, use the bestparams attribute to extract the optimized values of the parameters

State the accuracy of the model on the test dataset

Plot the ROC curve corresponding to your best model through greedy hyperparameter section vs the model identified through random search. In the legend of the plot, report the AUC for each curve

Plot the final decision boundary for the greedy and random search-based classifiers along with one of the training datasets to demonstrate the shape of the final boundary

How did the performance compare?

The performance of the Neural Net created in part D was created by taking the highest scoring parameters found in part C. This model received an AUC score of 0.73. The Neural net created from the best parameters of RandomSearchCV also had an AUC of 0.73. The parameters of these two models did vary, most notably on their node and layer size. The model from part D had 4 layers with 15 nodes, while the model generated from random search had 2 layers of 5 nodes.

My final model from part D has an accuracy of 0.73, and the random search model has a precision score of 0.72. This indicates Model D was slightly better in this reguard, but the results are likely somewhat random and the difference is negligible.

This indicates that the random search arrived at similar results to our more iterative process of parameters. Had we employed a grid search, I suspect the results would have been the same, or marginially better by a negligable amount.