Supervised Learning: model training and evaluation

Caleb O'Neel

Learning Objectives:

This project provides structured practice to help...

  1. Understand the primary workflow in machine learning: (1) identifying a hypothesis function set of models, (2) determining a loss/cost/error/objective function to minimize, and (3) minimizing that function through gradient descent
  2. Understand the inner workings of logistic regression and how linear models for classification can be developed.
  3. Gain practice in implementing machine learning algorithms from the most basic building blocks to understand the math and programming behind them to achieve practical proficiency with the techniques
  4. Implement batch gradient descent and become familiar with how that technique is used and its dependence on the choice of learning rate
  5. Evaluate supervised learning algorithm performance through ROC curves and using cross validation
  6. Develop an understanding the optimal minimum misclassification error classifier (Bayes' classifier)

1

[50 points] Classification using logistic regression: build it from the ground up

I. Load, prepare, and plot your data

You are given some data for which you are tasked with constructing a classifier. The first step when facing any machine learning project: look at your data!

(a) Load the data.

(b) Do the data require any preprocessing due to missing values, scale differences, etc.? If so, how did you handle these issues?

Next, we walk through our key steps for model fitting: choose a hypothesis set of models to train (in this case, logistic regression); identify a cost function to measure the model fit to our training data; optimize model parameters to minimize cost (in this case using gradient descent). Once we've completed model fitting, we will evaluate the performance of our model and compare performance to another approach (a KNN classifier).

This is a classification problem, so right away we know a logistic rather than a linear regression would be more appropriate if we want to use some form of regression. The Data appears to be fairly well segmented and I expect that a Logistic Regression classifier will have a lot of success distinguishing between the two classes. There are several observations that do not conform to the general location of the other observations in their class so I do not expext a near perfect classification rate. That said, their does appear to be a general trend to ther data and enough observations that a logistic regression would be a good choice for picking this decision boundry.

b.

Looking at the data, there are no missing values but there appears to be a slight issue with balance. x1 and x2 have fairly large differences in the mean and start devation values. To deal with this I will standardize both xolumns by subtracting the mean and dividing by the standard deviation.

II. Stating the hypothesis set of models to evaluate (we'll use logistic regression)

Given that our data consists of two features, our logistic regression problem will be applied to a two-dimensional feature space. Recall that our logistic regression model is:

$$f(\mathbf{x}_i,\mathbf{w})=\sigma(\mathbf{w}^{\top} \mathbf{x}_i)$$

where the sigmoid function is defined as $\sigma(x) = \dfrac{e^x}{1+e^{x}}= \dfrac{1}{1+e^{-x}}$. Also, since this is a two-dimensional problem, we define $\mathbf{w}^{\top} \mathbf{x}_i = w_0 x_{i,0} + w_1 x_{i,1} + w_2 x_{i,2}$ and here, $\mathbf{x}_i=[x_{i,0}, x_{i,1}, x_{i,2}]^{\top}$, and $x_{i,0} \triangleq 1$

Remember from class that we interpret our logistic regression classifier output (or confidence score) as the conditional probability that the target variable for a given sample $y_i$ is from class "1", given the observed features, $\mathbf{x}_i$. For one sample, $(y_i, \mathbf{x}_i)$, this is given as:

$$P(Y=1|X=\mathbf{x}_i) = f(\mathbf{x}_i,\mathbf{w})=\sigma(\mathbf{w}^{\top} \mathbf{x}_i)$$

In the context of maximizing the likelihood of our parameters given the data, we define this to be the likelihood function $L(\mathbf{w}|y_i,\mathbf{x}_i)$, corresponding to one sample observation from the training dataset.

Aside: the careful reader will recognize this expression looks different from when we talk about the likelihood of our data given the true class label, typically expressed as $P(x|y)$, or the posterior probability of a class label given our data, typically expressed as $P(y|x)$. In the context of training a logistic regression model, the likelihood we are interested in is the likelihood function of our logistic regression parameters, $\mathbf{w}$. It's our goal to use this to choose the parameters to maximize the likelihood function.

III. Find the cost function that we can use to choose the model parameters, $\mathbf{w}$, that best fit the training data.

(c) What is the likelihood function that corresponds to all the $N$ samples in our training dataset that we will wish to maximize? Unlike the likelihood function written above which gives the likelihood function for a single training data pair $(y_i, \mathbf{x}_i)$, this question asks for the likelihood function for the entire training dataset $\{(y_1, \mathbf{x}_1), (y_2, \mathbf{x}_2), ..., (y_N, \mathbf{x}_N)\}$.

(d) Since a logarithm is a monotonic function, maximizing the $f(x)$ is equivalent to maximizing $\ln [f(x)]$. Express the likelihood from the last question as a cost function of the model parameters, $C(\mathbf{w})$; that is the negative of the logarithm of the likelihood.

(e) Calculate the gradient of the cost function with respect to the model parameters $\nabla_{\mathbf{w}}C(\mathbf{w})$. Express this in terms of the partial derivatives of the cost function with respect to each of the parameters, e.g. $\nabla_{\mathbf{w}}C(\mathbf{w}) = \left[\dfrac{\partial C}{\partial w_0}, \dfrac{\partial C}{\partial w_1}, \dfrac{\partial C}{\partial w_2}\right]$.

To simplify notation, please use $\mathbf{w}^{\top}\mathbf{x}$ instead of writing out $w_0 x_{i,0} + w_1 x_{i,1} + w_2 x_{i,2}$ when it appears each time (where $x_{i,0} = 1$ for all $i$). You are also welcome to use $\sigma()$ to represent the sigmoid function. Lastly, this will be a function the features, $x_{i,j}$ (with the first index in the subscript representing the observation and the second the feature; targets, $y_i$; and the logistic regression model parameters, $w_j$.

(f) Write out the gradient descent update equation. This should clearly express how to update each weight from one step in gradient descent $w_j^{(k)}$ to the next $w_j^{(k+1)}$. There should be one equation for each model logistic regression model parameter (or you can represent it in vectorized form). Assume that $\eta$ represents the learning rate.

c.

$$ = L(p|x) $$

$$ = P(x|p) $$$$ = \prod_{i = 1}^{N}P(x_{i}|p) $$$$ = p^{\sum_{i=1}^{N}x_{i}}(1 - p)^{N - \sum_{i=1}^{N}x_{i}} $$$$= {\prod_{i = 1}^{N} \sigma (\mathbf{w}^{\top}\mathbf{x}})^{y_{i}}(1-\sigma (\mathbf{w}^{\top}\mathbf{x}))^{1-y_{i}} $$

Where: $$\sigma (\mathbf{w}^{\top}\mathbf{x}) = \frac{1}{1+e^{\mathbf{w}^{\top}\mathbf{x}}}$$

d.

$$ logL(w|y,X) = \sum_{i=1}^{N}y_{i}log(\hat{y}_{i}) + (1-y_{i})log(1-\hat{y}_{i})$$$$ C(w) = -logL(w|y,X) $$$$ C(w) = -\sum_{i=1}^{N}y_{i}log(\hat{y}_{i}) + (1-y_{i})log(1-\hat{y}_{i})$$

In this case, $\hat{y}$ is equivalent to our predicted value, or result of our sigmoid function which is: $$ \hat{y} = \frac{1}{1+e^{\mathbf{w}^{\top}\mathbf{x}}} $$

e. $$ C(w) = -\sum_{i=1}^{N}y_{i}log(\hat{y}_{i} + (1-y_{i})log(1-\hat{y}_{i})$$ $$ \hat{y}_{i} = \sigma \mathbf{w}^{\top}\mathbf{x} $$

$$ \nabla_{\mathbf{w}}C(\mathbf{w}) = \left[\dfrac{\partial C}{\partial w_0}, \dfrac{\partial C}{\partial w_1}, \dfrac{\partial C}{\partial w_2}\right] $$$$ \nabla_{\mathbf{w}}C(\mathbf{w}) = -\frac{1}{N}\sum_{i=1}^{N}[-y_{i}(log(1+e^{-\mathbf{w}^{\top}\mathbf{x}}))+(1 + y_{i})(-\mathbf{w}^{\top}\mathbf{x}-log(1+e^{-\mathbf{w}^{\top}\mathbf{x}}))] $$$$ = \frac{1}{N}\sum_{i=1}^{N}[y_{i}\mathbf{w}^{\top}\mathbf{x} - \mathbf{w}^{\top}\mathbf{x} - log(1+e^{\mathbf{w}^{\top}\mathbf{x}})] $$$$ \dfrac{\partial C}{\partial w_0} = \sum_{i=1}^N (\frac{1+\sigma(\mathbf{w}^{\top}\mathbf{x})}{\sigma(\mathbf{w}^{\top}\mathbf{x})} - y_i) * 1$$

$$\dfrac{\partial C}{\partial w_1} = \sum_{i=1}^N (\frac{1+\sigma(\mathbf{w}^{\top}\mathbf{x})}{\sigma(\mathbf{w}^{\top}\mathbf{x})} - y_i) * x1_{i}$$$$\dfrac{\partial C}{\partial w_2} = \sum_{i=1}^N (\frac{1+\sigma(\mathbf{w}^{\top}\mathbf{x})}{\sigma(\mathbf{w}^{\top}\mathbf{x})} - y_i) * x2_{i}$$

f.

$$ w_j^{(k+1)} = w_j^{(k)} + \eta \dfrac{\partial C_{j}}{\partial w_{ji}}$$$$ w_j^{(k+1)} = w_j^{(k)} + \eta (\sum_{i=1}^N (\frac{1+\sigma(\mathbf{w}^{\top}\mathbf{x})}{\sigma(\mathbf{w}^{\top}\mathbf{x})} - y_i) * x_{ji}) $$

Where J depends on how many dependendent variables are in the equation.

IV. Implement gradient descent and your logistic regression algorithm

(g) Implement your logistic regression model.

This structure is strongly encouraged; however, you're welcome to adjust this to your needs (adding helper methods, modifying parameters, etc.).

h.

The learning rate selected determines how quickly the model is able to adjust the weights of the parameters. As shown in the graph above, the larger the learning rate, the faster the cost dropped towards 0. When the learning rate was 1, the cost function dropped close to 0 within the first few steps. This is reinforced by what we see in the output of the first 50 weights printed above where the first handful of steps lead to large changes in the weights before rapidly leveling off. Returning to the graph, the learning rate of .01 dropped to 0 second quickest, followed by the learning rate of .0001, and the model with .000001 did not get far because the minimum threshold of the step size was instantly reached. A larger learning rate may help you to arrive at the optimized weights faster, but if the learning rate is too large, you risk swinging too far and never descending to the optimized cost and weights.

For the purposes of this data, I think a learning rate of $10^{-2}$ is best. It quickly minimizes the cost, but it does not risk being too large to effectively optimize the weights. 

V. Evaluate your model performance through cross validation

(i) Test the performance of your trained classifier using K-folds cross validation resampling technique. The scikit-learn package StratifiedKFolds may be helpful.

The purpose of doing k fold validation is to try and get a more accurate sense of what the test error will be. Oftentimes models only have access to the training data, and training error is often not a great predictive measure of test error rate. Training error rate leads to over or underfitting in many cases. One work around for this is to split your training data into training and validation data, but the problem with this is that it limits the amount of data you can work with to train your model. K-fold validation offers a partial solution to this problem. Instead of splitting your data, you can split it into sets, (in this case 10) and train your model on all subsets of the data except the one being withheld, then using the withheld data portion as the test set. This process is then repeated on all folds, and allows you to train your model and fine tune parameters while getting an idea of what the test error could be.

I could not get my ROC curve working unfortuneatly. However, I would expect the models to perform similarly ont his dataset given the data distribution and shape.

ANSWER

2

[30 points] Digits classification

An exploration of regularization, imbalanced classes, ROC and PR curves

Load your dataset from the MNIST dataset of handwritten digits, using the code provided below. MNIST has a training set of 60,000 examples, and a test set of 10,000 examples. The digits have been size-normalized and centered in a fixed-size image.

Your goal is to classify whether or not an example digit is a 3. Your binary classifier should predict $y=1$ if the digit is a 3, and $y=0$ otherwise. Create your dataset by transforming your labels into a binary format (3's are class 1, and all other digits are class 0).

(a) Plot 10 examples of each class (i.e. class $y=0$, which are not 3's and class $y=1$ which are 3's), from the training dataset.

(b) How many examples are present in each class? Show a histogram of samples by class. What fraction of samples are positive? What issues might this cause?

(c) Using a logistic regression classifier, apply lasso regularization and retrain the model and evaluate its performance over a range of values on the regularization coefficient. You can implement this using the LogisticRegression module and activating the 'l1' penalty; the parameter $C$ is the inverse of the regularization strength. Vary the value of C logarithmically from $10^{-4}$ to $10^4$ (and make your x-axes logarithmic in scale) and evaluate it at 20 different values of C. As you vary the regularization coefficient, Plot

(d) Train and test a (1) logistic regression classifier with minimal regularization (using the Scikit Learn package, set penalty='l1', C=1e100 to approximate this), (2) a logistic regression classifier with the best value of the regularization parameter from the last section, (3) a Linear Discriminant Analysis (LDA) Classifier, and (4) a Random Forest (RF) classifier (using default parameters for the LDA and RF classifiers).

ANSWER

a.

b.

c.

d.

In this section I created four different models to classify the data. The first model was a Lasso Logistic Regression with minimal regularization. The second model was also a logistic regression, but this time using the best regularization term from C, which I judged to be $ 10^{-2} $. The third model was a Linear Discriminant Analysis classifier, and the final model was a Random Forest Classifier. After training, fitting, and using these models to create predictions- I plotted their ROC and Precision Recall curves which can be seen above.

An AUC curve stands for Area Under the Curve, and is used as an indicator of a model's accuracy with True Positive rate on the Y-axis, and False Positive rate on the X-axis. The best models will have a true positive rate close to one, and a false positive rate close to 0, meaning that the further to the upper left the curve is, the better the model is. The area of that is beneath and to the right of the curve is "under the curve", and thus the larger the area under the curve is, the more accurate the model is deemed to be.

A Precision-Recall curve shares the same general concepts of an AUC, But the precision score is on the Y-axis, and recall score is on the x-axis. Precision is calculated by taking the total true positives and dividing by the sum of true positives and false positives. Recall is calculated by taking the total true positives and dividing by the sum of true positives and false negatives. We want to maximize both the precision and recall scores, so the best PR curves are in the upper right of the graph.

For the AUC scores, the two Logistic Regression classifiers performed almost identically, with both achieving an Area Under the Curve score of 0.98. The Linear Discriminant Analysis classifier was not far behind with an AUC of 0.97. The Random Forest Classifier performed the worst with an AUC of 0.94.

We saw the same general trends reflected in the PR-curve, with the two logistic regression models performing best with LDA not far behind, and the random forest classifier performing worst. In this graph, the second logistic regression model with a regularization of $10^{-2}$ seems to have performed slightly better than the default model with minimal regularization.

Because of its slightly better performance on the PR curve, I believe the second Lasso Logistic Regression model was the best of these options. I would choose to apply it to an unseen data set because it appears to perform the best on test data. However the differences in performance from the logistic regression model with minimal regularization is very slight. 

3

[20 points] Comparing the Bayes' decision rule with logistic regression

The phrase "Bayes' decision rule" is often used to describe a classifier decision rule that minimizes misclassification rate (equally penalizing false positives and false negatives) for a given problem. In this exercise you will first determine the Bayes' decision rule for a binary classification problem where you know the likelihood of data from each class.

This binary classification problem has two target classes with data distributed as exponential random variables:

$$P(x|C_i) = \lambda_i e^{-\lambda_i x}$$

Where $C_i$ represents the class from which the sample is drawn (0 or 1). This is known as the class-conditional likelihood, not surprisingly because it is the likelihood of the data conditioned on knowing what class it came from. This represents two separate density functions: one for the case when the class is 0 ($P(x|C_0)$)and one for when the class is 1 ($P(x|C_1)$). Assume that we know that $\lambda_0 = 5$ and $\lambda_1 = 1$ to fully-specify those distributions.

(a) Plot the probability of each class conditional distribution (e.g. likelihood function), $P(x|C_0)$ and $P(x|C_1)$ on the sample plot in the domain $x \in [0,2]$. You can use scipy's expon module for this. Note that the scale parameter for this module is defined as $1/\lambda$.

(b) Assuming the prior class distributions are $P(C_0)=P(C_1)=0.5$, determine the Bayes' decision rule using the information above. Remember that the Bayes Decision rule can be defined using the posterior distributions of the data; when $P(Y=1|x)>P(Y=0|x)$, predict Class 1, otherwise predict Class 0. In that way, you will assign the most probable class to the data based on the value of $x$. The decision rule will then be of the form:

If $x > x^*$, then predict Class 1, otherwise predict Class 0

Determine the value $x^*$ that minimizes misclassification (equally penalizing false positives and false negatives, and no penalty/reward for correct detections). Show your work in deriving this value.

(c) How does your answer in (b) relate to the the plot you made in (a)?

(d) What if instead, $P(C_1)=2P(C_0)$; what would the new value of $x^*$ be in this case? Before computing the value, think through how you would expect it to change, then see if the math supports your conclusion.

(e) Load the test data in the file A3_Q3_test.csv, which follows the distributions above. Apply your Bayes decision rule to the data. What is the misclassification rate (error rate, or fraction of misclassified samples) of this decision rule? This should be the best that any algorithm could achieve (on average).

(f) Load the training data in the file A3_Q3_train.csv (which follows the distributions above) and train a logistic regression classifier on the data (using default parameters) from Scikit-Learn's LogisticRegression module. What is your misclassification error for your test dataset? How does this compare with the Bayes' classifier?

(g) What is your decision rule for the logistic regression model you just trained? To compute this, extract the parameters from your fit model (look for the coef_ and intercept_ attributes) and since the classes are balanced, the decision rule will be to classify a sample $x$ as Class 1 when your logistic regression sigmoid is greater than 0.5 (the halfway point from the two extremes of 0 and 1), since we assume $P(C_1|x)=\sigma(w_0 + w_1 x)$ in logistic regression. How does the decision rule from logistic regression compare with the Bayes' classifier?

ANSWER

a.

b.

$$ f_{k}(x) = Pr(X = x|Y = k) $$
$$ Pr(Y = k|X = x) = \frac{\pi_{k}f_{k}(x)}{\sum_{l=1}^{K}\pi_{l}f_{l}(x) } $$
$$ \frac{p(x|w)P(w)}{\sum_{i}p(x|w_{i})P(w_{i})} $$
$$P(x|C_i) = \lambda_i e^{-\lambda_i x}$$
$$ \frac{e^{-x}0.5}{e^{-x}0.5 + 5e^{-5x}0.5} = \frac{5e^{-5x}0.5}{e^{-5x}0.5 + 5e^{-x}0.5} $$
$$ e^{-x} = 5e^{-5x} $$
$$ x = -ln(\frac{1}{5})*(\frac{1}{4}) $$
$$ x = .4024 $$

c.

The decsision boundry signifies the place where the probability of an obseration belonging to a given class switches from one to the other. The graph in part A signifies the respective probability of belonging to the two classes given a value of X. The value calculated in part B is the value for X where the probabilites intersect and have an equal probability of being hte correct class for the observation. For any x values less that .4024, the class where lambda is 5 has a higher likelihood of being the correct class. For values greater than .4024, the class where lambda is one has a higher probability of being the correct class.

d.

If $P(C_{1}) = 2P(C_{0}), P(C_{1})$ should be increasing in value leading to x, the decision boundry, to increase. My equation for finding the new decision boudnry can be seen below:

$$ \frac{P(x|Y=1)P(Y=1)}{P(x)} = \frac{P(x|Y=0)P(Y=1)}{2P(x)} $$$$ 2P(x) \frac{P(x|Y=1)P(Y=1)}{P(x)} = 2P(x)\frac{P(x|Y=0)P(Y=1)}{2P(x)} $$$$ 2P(x){P(x|Y=1)} = {P(x|Y=0)} $$$$ 2\lambda_{5}e^{-\lambda_{5}x} = \lambda e^{-\lambda_{1} x} $$$$ 10e^{-5x} = e^{-x} $$$$ log(10) + log(e^{-5x}) = log(e^{-x}) $$$$ log(10) = 4x $$$$ \frac{log(10)}{4} = x $$$$ x = 0.58 $$

The math supports my initial hypothesis that $P(C_{1}) = 2P(C_{0})$ would increase the x value of decision boundry which increased from .4026 to .58

e.

The misclassification rate on the Bayes classifier I calculated above is .23395, or 23.395%.

f.

The missclassification rate in this model is .234, which is just barely larger than the Bayes classication error I calculated in part E. The Bayes classifier should hypothetically be the best you can do, so having a misclassification rate so similar to it signifies that logistic regression is a good model for this data.

g.

The decision boundry for the model is at 0.4034 which is extremely close to the .4024 I calculated for the Bayes decision boundry in part B. This explains why the misclassification rate is virtually identicle, and knowing that the Bayes decision boundry is the best we can do it suggests that logistic regression is a good choice for this data.