This project provides structured practice to help...
You are given some data for which you are tasked with constructing a classifier. The first step when facing any machine learning project: look at your data!
(a) Load the data.
A3_Q1_data.csv. This file contains the binary class labels, $y$, and the features $x_1$ and $x_2$.(b) Do the data require any preprocessing due to missing values, scale differences, etc.? If so, how did you handle these issues?
Next, we walk through our key steps for model fitting: choose a hypothesis set of models to train (in this case, logistic regression); identify a cost function to measure the model fit to our training data; optimize model parameters to minimize cost (in this case using gradient descent). Once we've completed model fitting, we will evaluate the performance of our model and compare performance to another approach (a KNN classifier).
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import statistics as stats
# Import Data
lrdata = pd.read_csv('./Data/A3_Q1_data.csv')
# Format and split the data
#xarr = lrdata.iloc[:,0:2].values
#yarr = lrdata.iloc[:,2].values
xarr = lrdata.values[:,0:2]
yarr = lrdata.values[:,2]
#xarr[:,0] = (xarr[:,0].mean() - xarr[:,0])/stats.stdev(xarr[:,0])
#xarr[:,1] = (xarr[:,1].mean() - xarr[:,1])/stats.stdev(xarr[:,1])
X_train, X_test, y_train, y_test = train_test_split(xarr, yarr, test_size=0.30, shuffle = True, random_state=42)
# Plot the data in a scatterplot
x1 = X_train[:,0]
x2 = X_train[:,1]
fig, ax = plt.subplots()
scatter = plt.scatter(x1, x2, c = y_train)
plt.xlabel('x1 Observation')
plt.ylabel('x2 Observation')
plt.title('Training Data Scatterplot')
legend = ax.legend(*scatter.legend_elements(), loc="lower right", title="Classes")
ax.add_artist(legend)
<matplotlib.legend.Legend at 0x11fc58dc0>
This is a classification problem, so right away we know a logistic rather than a linear regression would be more appropriate if we want to use some form of regression. The Data appears to be fairly well segmented and I expect that a Logistic Regression classifier will have a lot of success distinguishing between the two classes. There are several observations that do not conform to the general location of the other observations in their class so I do not expext a near perfect classification rate. That said, their does appear to be a general trend to ther data and enough observations that a logistic regression would be a good choice for picking this decision boundry.
b.
#Describe the data, check for missing values
print(lrdata.isnull().values.any())
print(lrdata.describe())
False
x1 x2 y
count 200.000000 200.000000 200.000000
mean 0.151376 -0.385426 0.485000
std 1.411722 1.217490 0.501029
min -3.210005 -3.193456 0.000000
25% -0.912029 -1.341047 0.000000
50% 0.112286 -0.479684 0.000000
75% 1.174400 0.495114 1.000000
max 3.867647 3.103541 1.000000
Looking at the data, there are no missing values but there appears to be a slight issue with balance. x1 and x2 have fairly large differences in the mean and start devation values. To deal with this I will standardize both xolumns by subtracting the mean and dividing by the standard deviation.
Given that our data consists of two features, our logistic regression problem will be applied to a two-dimensional feature space. Recall that our logistic regression model is:
$$f(\mathbf{x}_i,\mathbf{w})=\sigma(\mathbf{w}^{\top} \mathbf{x}_i)$$where the sigmoid function is defined as $\sigma(x) = \dfrac{e^x}{1+e^{x}}= \dfrac{1}{1+e^{-x}}$. Also, since this is a two-dimensional problem, we define $\mathbf{w}^{\top} \mathbf{x}_i = w_0 x_{i,0} + w_1 x_{i,1} + w_2 x_{i,2}$ and here, $\mathbf{x}_i=[x_{i,0}, x_{i,1}, x_{i,2}]^{\top}$, and $x_{i,0} \triangleq 1$
Remember from class that we interpret our logistic regression classifier output (or confidence score) as the conditional probability that the target variable for a given sample $y_i$ is from class "1", given the observed features, $\mathbf{x}_i$. For one sample, $(y_i, \mathbf{x}_i)$, this is given as:
$$P(Y=1|X=\mathbf{x}_i) = f(\mathbf{x}_i,\mathbf{w})=\sigma(\mathbf{w}^{\top} \mathbf{x}_i)$$In the context of maximizing the likelihood of our parameters given the data, we define this to be the likelihood function $L(\mathbf{w}|y_i,\mathbf{x}_i)$, corresponding to one sample observation from the training dataset.
Aside: the careful reader will recognize this expression looks different from when we talk about the likelihood of our data given the true class label, typically expressed as $P(x|y)$, or the posterior probability of a class label given our data, typically expressed as $P(y|x)$. In the context of training a logistic regression model, the likelihood we are interested in is the likelihood function of our logistic regression parameters, $\mathbf{w}$. It's our goal to use this to choose the parameters to maximize the likelihood function.
(c) What is the likelihood function that corresponds to all the $N$ samples in our training dataset that we will wish to maximize? Unlike the likelihood function written above which gives the likelihood function for a single training data pair $(y_i, \mathbf{x}_i)$, this question asks for the likelihood function for the entire training dataset $\{(y_1, \mathbf{x}_1), (y_2, \mathbf{x}_2), ..., (y_N, \mathbf{x}_N)\}$.
(d) Since a logarithm is a monotonic function, maximizing the $f(x)$ is equivalent to maximizing $\ln [f(x)]$. Express the likelihood from the last question as a cost function of the model parameters, $C(\mathbf{w})$; that is the negative of the logarithm of the likelihood.
(e) Calculate the gradient of the cost function with respect to the model parameters $\nabla_{\mathbf{w}}C(\mathbf{w})$. Express this in terms of the partial derivatives of the cost function with respect to each of the parameters, e.g. $\nabla_{\mathbf{w}}C(\mathbf{w}) = \left[\dfrac{\partial C}{\partial w_0}, \dfrac{\partial C}{\partial w_1}, \dfrac{\partial C}{\partial w_2}\right]$.
To simplify notation, please use $\mathbf{w}^{\top}\mathbf{x}$ instead of writing out $w_0 x_{i,0} + w_1 x_{i,1} + w_2 x_{i,2}$ when it appears each time (where $x_{i,0} = 1$ for all $i$). You are also welcome to use $\sigma()$ to represent the sigmoid function. Lastly, this will be a function the features, $x_{i,j}$ (with the first index in the subscript representing the observation and the second the feature; targets, $y_i$; and the logistic regression model parameters, $w_j$.
(f) Write out the gradient descent update equation. This should clearly express how to update each weight from one step in gradient descent $w_j^{(k)}$ to the next $w_j^{(k+1)}$. There should be one equation for each model logistic regression model parameter (or you can represent it in vectorized form). Assume that $\eta$ represents the learning rate.
c.
$$ = L(p|x) $$ $$ = P(x|p) $$$$ = \prod_{i = 1}^{N}P(x_{i}|p) $$$$ = p^{\sum_{i=1}^{N}x_{i}}(1 - p)^{N - \sum_{i=1}^{N}x_{i}} $$$$= {\prod_{i = 1}^{N} \sigma (\mathbf{w}^{\top}\mathbf{x}})^{y_{i}}(1-\sigma (\mathbf{w}^{\top}\mathbf{x}))^{1-y_{i}} $$Where: $$\sigma (\mathbf{w}^{\top}\mathbf{x}) = \frac{1}{1+e^{\mathbf{w}^{\top}\mathbf{x}}}$$
d.
$$ logL(w|y,X) = \sum_{i=1}^{N}y_{i}log(\hat{y}_{i}) + (1-y_{i})log(1-\hat{y}_{i})$$$$ C(w) = -logL(w|y,X) $$$$ C(w) = -\sum_{i=1}^{N}y_{i}log(\hat{y}_{i}) + (1-y_{i})log(1-\hat{y}_{i})$$In this case, $\hat{y}$ is equivalent to our predicted value, or result of our sigmoid function which is: $$ \hat{y} = \frac{1}{1+e^{\mathbf{w}^{\top}\mathbf{x}}} $$
e. $$ C(w) = -\sum_{i=1}^{N}y_{i}log(\hat{y}_{i} + (1-y_{i})log(1-\hat{y}_{i})$$ $$ \hat{y}_{i} = \sigma \mathbf{w}^{\top}\mathbf{x} $$
$$ \nabla_{\mathbf{w}}C(\mathbf{w}) = \left[\dfrac{\partial C}{\partial w_0}, \dfrac{\partial C}{\partial w_1}, \dfrac{\partial C}{\partial w_2}\right] $$$$ \nabla_{\mathbf{w}}C(\mathbf{w}) = -\frac{1}{N}\sum_{i=1}^{N}[-y_{i}(log(1+e^{-\mathbf{w}^{\top}\mathbf{x}}))+(1 + y_{i})(-\mathbf{w}^{\top}\mathbf{x}-log(1+e^{-\mathbf{w}^{\top}\mathbf{x}}))] $$$$ = \frac{1}{N}\sum_{i=1}^{N}[y_{i}\mathbf{w}^{\top}\mathbf{x} - \mathbf{w}^{\top}\mathbf{x} - log(1+e^{\mathbf{w}^{\top}\mathbf{x}})] $$$$ \dfrac{\partial C}{\partial w_0} = \sum_{i=1}^N (\frac{1+\sigma(\mathbf{w}^{\top}\mathbf{x})}{\sigma(\mathbf{w}^{\top}\mathbf{x})} - y_i) * 1$$ $$\dfrac{\partial C}{\partial w_1} = \sum_{i=1}^N (\frac{1+\sigma(\mathbf{w}^{\top}\mathbf{x})}{\sigma(\mathbf{w}^{\top}\mathbf{x})} - y_i) * x1_{i}$$$$\dfrac{\partial C}{\partial w_2} = \sum_{i=1}^N (\frac{1+\sigma(\mathbf{w}^{\top}\mathbf{x})}{\sigma(\mathbf{w}^{\top}\mathbf{x})} - y_i) * x2_{i}$$f.
$$ w_j^{(k+1)} = w_j^{(k)} + \eta \dfrac{\partial C_{j}}{\partial w_{ji}}$$$$ w_j^{(k+1)} = w_j^{(k)} + \eta (\sum_{i=1}^N (\frac{1+\sigma(\mathbf{w}^{\top}\mathbf{x})}{\sigma(\mathbf{w}^{\top}\mathbf{x})} - y_i) * x_{ji}) $$Where J depends on how many dependendent variables are in the equation.
(g) Implement your logistic regression model.
sigmoid that calculates the sigmoid functioncost that computes the cost function $C(\mathbf{w})$ for a given dataset and corresponding class labels. This should be the average cost (make sure your total cost is divided by your number of samples in the dataset).gradient_descent to run gradient descent on your training data. We'll refer to this as "batch" gradient descent since it takes into account the gradient based on all our data at each iteration of the algorithm. In doing this we'll need to make some assumptions about the following:fit that fits the model to the data (i.e. sets the model parameters to minimize cost) using your gradient_descent methodpredict_proba that predicts confidence scores (that can be thresholded into the predictions of the predict method.predict that makes predictions based on the trained model, selecting the most probable class, given the data, as the prediction, that is class that yields the larger $P(y|\mathbf{x})$.learning_curve that produces the cost function values that correspond to each step from a previously run gradient descent operation.prepare_x which appends a column of ones as the first feature of the dataset $\mathbf{X}$ to account for the bias term ($x_{i,1}=1$).This structure is strongly encouraged; however, you're welcome to adjust this to your needs (adding helper methods, modifying parameters, etc.).
import numpy as np
import pandas as pd
# Logistic regression class
class Logistic_regression:
#Class constructor
def __init__(self):
self.w = None # logistic regression weights
self.saved_w = [] # Since this is a small problem, we can save the weights
# at each iteration of gradient descent to build our
# learning curves
# returns nothing
self.cost_list = []
pass
def sigmoid(self, X, w):
# returns the value of the sigmoid
sigmoid = 1 / (1 + np.exp(-np.dot(X,w.T)))
#print(sigmoid)
return sigmoid
def cost(self, X, y, w):
# returns the average cross entropy cost
y_hat = self.sigmoid(X, w)
N = X.shape[0]
#cost = -(1/N)*(np.sum(y @ np.log(y_hat) + (1-y).T@np.log(1-y_hat)))
#cost = np.log(1-y_hat)*(-1)
cost = (np.dot(y.T, np.log(y_hat))+ np.dot((1-y).T, np.log(1-y_hat)))*(-1/N)
return cost
pass
def gradient_descent(self, X, y, lr):
# returns s scalar of the magnitude of the Euclidean norm
# of the change in the weights during one gradient descent step
#print(np.sum(np.dot(X.T,self.sigmoid(X, self.w)-y)))
gradient = (np.exp(np.dot(X,self.w.T))/(1+(np.exp(np.dot(X,self.w.T)))))
wdelta = []
N = X.shape[0]
wdelta0 = (1/N)*(np.sum(np.dot(X[:,0], gradient))-np.sum(np.dot(X[:,0], y)))
wdelta1 = (1/N)*(np.sum(np.dot(X[:,1], gradient))-np.sum(np.dot(X[:,1], y)))
wdelta2 = (1/N)*(np.sum(np.dot(X[:,2], gradient))-np.sum(np.dot(X[:,2], y)))
wdelta = [wdelta0, wdelta1, wdelta2]
#print(wdelta, lr)
wdelta = [i * lr for i in wdelta]
#print(wdelta)
self.w = self.w - wdelta
#print(self.w)
return np.linalg.norm(wdelta)
def fit(self, X, y, w_init, lr, delta_thresh=1e-6, max_iter=5000, verbose=False):
# Note the verbose flag enables you to print out the weights at each iteration
# (optional - but may help with one of the questions)
self.w = w_init
X = self.prepare_x(X)
count = 0
while count <= max_iter:
if verbose == True:
print(self.w)
self.saved_w.append(self.w)
step_size = self.gradient_descent(X, y, lr)
cost = self.cost(X, y, self.w)
self.cost_list.append(cost)
if step_size <= delta_thresh:
break
count = count + 1
# returns nothing
pass
def predict_proba(self, X):
# returns the confidence score for the each sample
X = self.prepare_x(X)
prob = self.sigmoid(X, self.w)
return prob
# Use the trained model to make binary predictions
def predict(self, X, thresh=0.5):
# returns a binary prediction for each sample
prob = self.predict_proba(X)
predictions = []
for value in prob:
if value > thresh:
pred = 1
else:
pred = 0
predictions.append(pred)
return predictions
# Stores the learning curves from saved weights from gradient descent
def learning_curve(self, X, y):
# returns the value of the cost function from each step in gradient descent
# from the last model fitting process
cost_list = []
for wgt in saved_w:
c = self.cost(X, y, wgt)
cost_list.append(c)
return cost_list
# Appends a column of ones as the first feature to account for the bias term
def prepare_x(self, X):
# returns the X with a new feature of all ones (a column that is the new column 0)
X = np.concatenate((np.ones((len(X), 1)), X), axis=1)
return X
h.
# Random seed to get consistent starting weights
np.random.seed(123)
# Set random weights and learning rate based on given material
W = np.random.rand(1,3)
lr_list = [10**(-2), 10**(-4), 10**(-6), 10**0]
for lr in lr_list:
model = Logistic_regression()
model.fit(X_train, y_train, W, lr)
predictions = model.predict(X_test)
plt.plot(model.cost_list, label= f'Learning Curve LR = {lr}')
plt.title('Learning Curves')
plt.xlabel('Step Number')
plt.ylabel('Cost Value')
plt.legend()
#Create, fit, and predict with model
model = Logistic_regression()
model.fit(X_train, y_train, W, 10**0)
predictions = model.predict(X_test)
weights = model.saved_w[:50]
weights
[array([[0.69646919, 0.28613933, 0.22685145]]), array([[ 0.59245704, -0.39126897, -0.15613718]]), array([[ 0.47389284, -0.71272569, -0.28158751]]), array([[ 0.3838197 , -0.91348266, -0.34142249]]), array([[ 0.31251674, -1.06206543, -0.37876474]]), array([[ 0.2542961 , -1.18079755, -0.40556752]]), array([[ 0.20575412, -1.27988801, -0.42655781]]), array([[ 0.16465837, -1.36493536, -0.44397086]]), array([[ 0.12944883, -1.43937497, -0.45898568]]), array([[ 0.09898718, -1.50548248, -0.47227418]]), array([[ 0.07241661, -1.56485133, -0.48424364]]), array([[ 0.04907688, -1.61864536, -0.49515459]]), array([[ 0.02844971, -1.66774319, -0.5051823 ]]), array([[ 0.010122 , -1.71282613, -0.5144504 ]]), array([[-0.00623975, -1.75443419, -0.52305022]]), array([[-0.02090787, -1.79300336, -0.53105211]]), array([[-0.03410688, -1.82889119, -0.5385124 ]]), array([[-0.04602358, -1.86239489, -0.54547769]]), array([[-0.05681464, -1.89376441, -0.55198757]]), array([[-0.06661241, -1.92321214, -0.55807649]]), array([[-0.07552947, -1.95092019, -0.56377488]]), array([[-0.08366219, -1.97704599, -0.56911003]]), array([[-0.09109355, -2.00172662, -0.57410662]]), array([[-0.09789538, -2.02508221, -0.57878718]]), array([[-0.10413022, -2.0472187 , -0.58317236]]), array([[-0.10985272, -2.06823 , -0.58728119]]), array([[-0.11511092, -2.08819976, -0.59113127]]), array([[-0.11994716, -2.10720285, -0.59473891]]), array([[-0.12439899, -2.12530656, -0.59811924]]), array([[-0.12849976, -2.14257159, -0.60128636]]), array([[-0.13227924, -2.15905292, -0.60425336]]), array([[-0.13576408, -2.17480047, -0.60703249]]), array([[-0.13897823, -2.18985975, -0.60963516]]), array([[-0.14194324, -2.20427235, -0.61207202]]), array([[-0.14467862, -2.2180764 , -0.61435304]]), array([[-0.14720199, -2.2313069 , -0.61648754]]), array([[-0.14952941, -2.24399611, -0.61848423]]), array([[-0.15167545, -2.2561738 , -0.62035129]]), array([[-0.15365342, -2.26786752, -0.62209636]]), array([[-0.15547552, -2.27910279, -0.62372663]]), array([[-0.15715288, -2.28990331, -0.62524881]]), array([[-0.15869574, -2.30029114, -0.62666922]]), array([[-0.16011351, -2.31028684, -0.6279938 ]]), array([[-0.16141488, -2.31990957, -0.6292281 ]]), array([[-0.16260783, -2.32917728, -0.63037737]]), array([[-0.16369975, -2.33810676, -0.63144651]]), array([[-0.16469749, -2.34671373, -0.63244017]]), array([[-0.16560737, -2.35501298, -0.63336268]]), array([[-0.16643528, -2.36301839, -0.63421816]]), array([[-0.16718667, -2.37074303, -0.63501046]])]
The learning rate selected determines how quickly the model is able to adjust the weights of the parameters. As shown in the graph above, the larger the learning rate, the faster the cost dropped towards 0. When the learning rate was 1, the cost function dropped close to 0 within the first few steps. This is reinforced by what we see in the output of the first 50 weights printed above where the first handful of steps lead to large changes in the weights before rapidly leveling off. Returning to the graph, the learning rate of .01 dropped to 0 second quickest, followed by the learning rate of .0001, and the model with .000001 did not get far because the minimum threshold of the step size was instantly reached. A larger learning rate may help you to arrive at the optimized weights faster, but if the learning rate is too large, you risk swinging too far and never descending to the optimized cost and weights.
For the purposes of this data, I think a learning rate of $10^{-2}$ is best. It quickly minimizes the cost, but it does not risk being too large to effectively optimize the weights.
(i) Test the performance of your trained classifier using K-folds cross validation resampling technique. The scikit-learn package StratifiedKFolds may be helpful.
from sklearn.neighbors import KNeighborsClassifier
#Learning Rate from aboce
lr = 10**-2
#Create the LR and KNN models for plotting
LRmodel = Logistic_regression()
LRmodel.fit(X_train, y_train, W, lr)
predictions = model.predict(X_test)
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)
# Store the models so they can be iteraed later
Model_graph = [LRmodel, knn]
# Bring in the data from the training set formatted for graphing
x1 = X_train[:,0]
x2 = X_train[:,1]
y = y_train
#Graph colors to pass in
colors1 = ListedColormap(['cyan', 'black', 'mediumaquamarine'])
def plot(mod, x1, x2, y, lr, ax, i):
colors1 = ListedColormap(['cyan', 'black', 'mediumaquamarine'])
x1_min, x1_max = x1.min() - 1, x1.max() + 1
x2_min, x2_max = x1.min() - 1, x2.max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, lr),
np.arange(x2_min, x2_max, lr))
#Model Predictions, needs to convert to array to be reshaped
print(type(mod),mod)
Z = np.array(mod.predict(np.c_[xx1.ravel(), xx2.ravel()]))
Z = Z.reshape(xx1.shape)
# Plot decision boundry and predictions on the graph
ax.contourf(xx1, xx2, Z, cmap=colors1)
# Add plot formatting, I had to make the font size smaller in order to
ax.scatter(x1,x2, c=y)
ax.set_xlabel('X Coordinate 1', fontsize = 10)
ax.set_ylabel('X Coordinate 2', fontsize = 10)
if i == 0:
ax.set_title(f'Logistic Regression Training Set', fontsize = 13)
else:
ax.set_title(f'KNN Training Set', fontsize = 13)
ax.legend(*scatter.legend_elements(),loc="lower right", title="Classes") #ax.legend([0,1], loc = 'upper right', fontsize = 10)
fig, ax = plt.subplots(1,2, figsize = (10,5))
for index, mod in enumerate(Model_graph):
plot(mod,x1, x2, y, lr, ax[index], index)
<class '__main__.Logistic_regression'> <__main__.Logistic_regression object at 0x136873f70> <class 'sklearn.neighbors._classification.KNeighborsClassifier'> KNeighborsClassifier(n_neighbors=7)
x1 = X_test[:,0]
x2 = X_test[:,1]
y = y_test
def plot(mod, x1, x2, y, lr, ax, i):
colors1 = ListedColormap(['cyan', 'black', 'mediumaquamarine'])
x1_min, x1_max = x1.min() - 1, x1.max() + 1
x2_min, x2_max = x1.min() - 1, x2.max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, lr),
np.arange(x2_min, x2_max, lr))
#Model Predictions, needs to convert to array to be reshaped
print(type(mod),mod)
Z = np.array(mod.predict(np.c_[xx1.ravel(), xx2.ravel()]))
Z = Z.reshape(xx1.shape)
# Plot decision boundry and predictions on the graph
ax.contourf(xx1, xx2, Z, cmap=colors1)
# Add plot formatting, I had to make the font size smaller in order to
ax.scatter(x1,x2, c=y)
ax.set_xlabel('X Coordinate 1', fontsize = 10)
ax.set_ylabel('X Coordinate 2', fontsize = 10)
if i == 0:
ax.set_title(f'Logistic Regression Test Set', fontsize = 13)
else:
ax.set_title(f'KNN Test Set', fontsize = 13)
ax.legend(*scatter.legend_elements(),loc="lower right", title="Classes") #ax.legend([0,1], loc = 'upper right', fontsize = 10)
fig, ax = plt.subplots(1,2, figsize = (10,5))
for index, mod in enumerate(Model_graph):
plot(mod,x1, x2, y, lr, ax[index], index)
<class '__main__.Logistic_regression'> <__main__.Logistic_regression object at 0x136873f70> <class 'sklearn.neighbors._classification.KNeighborsClassifier'> KNeighborsClassifier(n_neighbors=7)
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
# ROC Curves
lr = 10**-2
# Create the K fold Validations
kfold = StratifiedKFold(n_splits=10)
kfold.get_n_splits(X_train, y_train)
kflr = Logistic_regression()
#Create Lists to store true positive and false positive rates which make up the ROC curves
knn_pred_list = []
lr_pred_list = []
y_actual = []
for i, (train, test) in enumerate(kfold.split(X_train, y_train)):
#split the data into the Kfoldsamples
X_trainsample = X_train[train,:]
X_testsample = X_train[test,:]
y_trainsample = y_train[train]
y_testsample = y_train[test]
kflr = Logistic_regression()
kflr.fit(X_trainsample, y_trainsample, W, lr)
predict = kflr.predict_proba(X_testsample)
lr_pred_list.append(predict)
# make knn predictions
KFknn = KNeighborsClassifier(n_neighbors=7)
KFknn = KFknn.fit(X_trainsample, y_trainsample)
KNNpred = KFknn.predict_proba(X_testsample)
knn_pred_list.append(KNNpred)
y_actual.append(y_testsample)
# ROC Curve
#knnfpr, knntpr, knnthreshold = metrics.roc_curve(y_actual, knn_pred_list)
#knn_auc = metrics.auc(knnfpr, knntpr)
df = pd.DataFrame(lr_pred_list, columns=['y_pred'])
lrfpr, lrtpr, lrthreshold = metrics.roc_curve(y_actual, df['y_pred'].values)
lr_auc = metrics.auc(lrfpr, lrtpr)
# ROC Curve plot
plt.figure(figsize = (10,10))
#Plot all models
#plt.plot(knnfpr, knntpr,label = 'K-Fold Logistic Regression AUC = %0.2f')
plt.plot(lrfpr, lrtpr,label = 'K-Fold Logistic Regression AUC = %0.2f')
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Random guess')
plt.legend(loc = 'lower right')
plt.title('ROC Curve For Different Classifiers')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid('on')
plt.axis('square')
plt.show()
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-465-9ad1b20ba94b> in <module> 3 #knn_auc = metrics.auc(knnfpr, knntpr) 4 ----> 5 df = pd.DataFrame(lr_pred_list, columns=['y_pred']) 6 lrfpr, lrtpr, lrthreshold = metrics.roc_curve(y_actual, df['y_pred'].values) 7 lr_auc = metrics.auc(lrfpr, lrtpr) /opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy) 521 mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype) 522 else: --> 523 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy) 524 else: 525 mgr = init_dict({}, index, columns, dtype=dtype) /opt/anaconda3/lib/python3.8/site-packages/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy) 188 # by definition an array here 189 # the dtypes will be coerced to a single dtype --> 190 values = _prep_ndarray(values, copy=copy) 191 192 if dtype is not None: /opt/anaconda3/lib/python3.8/site-packages/pandas/core/internals/construction.py in _prep_ndarray(values, copy) 322 values = values.reshape((values.shape[0], 1)) 323 elif values.ndim != 2: --> 324 raise ValueError(f"Must pass 2-d input. shape={values.shape}") 325 326 return values ValueError: Must pass 2-d input. shape=(10, 14, 1)
The purpose of doing k fold validation is to try and get a more accurate sense of what the test error will be. Oftentimes models only have access to the training data, and training error is often not a great predictive measure of test error rate. Training error rate leads to over or underfitting in many cases. One work around for this is to split your training data into training and validation data, but the problem with this is that it limits the amount of data you can work with to train your model. K-fold validation offers a partial solution to this problem. Instead of splitting your data, you can split it into sets, (in this case 10) and train your model on all subsets of the data except the one being withheld, then using the withheld data portion as the test set. This process is then repeated on all folds, and allows you to train your model and fine tune parameters while getting an idea of what the test error could be.
I could not get my ROC curve working unfortuneatly. However, I would expect the models to perform similarly ont his dataset given the data distribution and shape.
ANSWER
An exploration of regularization, imbalanced classes, ROC and PR curves
Load your dataset from the MNIST dataset of handwritten digits, using the code provided below. MNIST has a training set of 60,000 examples, and a test set of 10,000 examples. The digits have been size-normalized and centered in a fixed-size image.
Your goal is to classify whether or not an example digit is a 3. Your binary classifier should predict $y=1$ if the digit is a 3, and $y=0$ otherwise. Create your dataset by transforming your labels into a binary format (3's are class 1, and all other digits are class 0).
(a) Plot 10 examples of each class (i.e. class $y=0$, which are not 3's and class $y=1$ which are 3's), from the training dataset.
reshape).(b) How many examples are present in each class? Show a histogram of samples by class. What fraction of samples are positive? What issues might this cause?
(c) Using a logistic regression classifier, apply lasso regularization and retrain the model and evaluate its performance over a range of values on the regularization coefficient. You can implement this using the LogisticRegression module and activating the 'l1' penalty; the parameter $C$ is the inverse of the regularization strength. Vary the value of C logarithmically from $10^{-4}$ to $10^4$ (and make your x-axes logarithmic in scale) and evaluate it at 20 different values of C. As you vary the regularization coefficient, Plot
coef_, which gives you access to the model parameters for a trained model)log_loss function)f1_score function which may be useful.(d) Train and test a (1) logistic regression classifier with minimal regularization (using the Scikit Learn package, set penalty='l1', C=1e100 to approximate this), (2) a logistic regression classifier with the best value of the regularization parameter from the last section, (3) a Linear Discriminant Analysis (LDA) Classifier, and (4) a Random Forest (RF) classifier (using default parameters for the LDA and RF classifiers).
# Load the MNIST Data
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import pickle
# Set this to True to download the data for the first time and False after the first time
# so that you just load the data locally instead
download_data = True
if download_data:
# Load data from https://www.openml.org/d/554
X, y = fetch_openml('mnist_784', return_X_y=True, as_frame=False)
# Adjust the labels to be '1' if y==3, and '0' otherwise
y[y!='3'] = 0
y[y=='3'] = 1
y = y.astype('int')
# Divide the data intro a training and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/7, random_state=88)
file = open('tmpdata', 'wb')
pickle.dump((X_train, X_test, y_train, y_test), file)
file.close()
else:
file = open('tmpdata', 'rb')
X_train, X_test, y_train, y_test = pickle.load(file)
file.close()
ANSWER
a.
indices = [i for i, x in enumerate(y_train) if x == 1]
xsamplelist = indices[:10]
sample_list = []
for index in xsamplelist:
sample = X_train[index,:]
sample = np.reshape(sample, (28,28))
sample_list.append(sample)
indices0 = [i for i, x in enumerate(y_train) if x == 0]
xsamplelist0 = indices0[:10]
sample_list0 = []
for index in xsamplelist0:
sample0 = X_train[index,:]
sample0 = np.reshape(sample0, (28,28))
sample_list0.append(sample0)
# Sample of 3's
plt.figure(figsize = (25,25))
for i in range(0,len(sample_list)):
plt.subplot(1,10,i+1)
plt.imshow(sample_list[i], cmap = 'gray')
plt.title("Class 1")
plt.show()
# Sampel of non 3's
plt.figure(figsize = (25,25))
for i in range(0,len(sample_list0)):
plt.subplot(1,10,i+1)
plt.imshow(sample_list0[i], cmap = 'gray')
plt.title("Class 0")
plt.show()
b.
print(f'There are {len(indices)} examples of threes, and {len(indices0)} examples of numbers that are not threes in this dataset')
colors = ListedColormap(['Blue', 'Green'])
plt.hist((indices0, indices), bins = 1, color = ['Blue', 'Green'])
plt.title('Histogram of Class distrribution')
plt.xlabel('Class')
plt.ylabel('Number of Observations')
#plt.xticks()
There are 6129 examples of threes, and 53871 examples of numbers that are not threes in this dataset
Text(0, 0.5, 'Number of Observations')
c.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.metrics import f1_score
from sklearn import metrics
# Initialize lists for all the scores
cost_list = []
coef_list = []
f1_list = []
auc_list = []
C = np.logspace(-4,4,20)
# Loop through the values in C to run model with all 20 iterations of the logspace parameter
for c in C:
model = LogisticRegression(penalty='l1', C=c, solver = 'liblinear')
model.fit(X_train, y_train)
# Make predictions using the model, compute the
predictions = model.predict(X_test)
f_cost = log_loss(y_test, predictions)
cost_list.append(f_cost)
#Store number of nonzero parameters
coefs = np.sum(model.coef_ != 0)
coef_list.append(coefs)
# Calculate and store the F1 score
f1 = f1_score(y_test, predictions, pos_label=1)
f1_list.append(f1)
#Get AUC Info
fpr, tpr, thresholds = metrics.roc_curve(y_test, predictions, pos_label=1)
auc = metrics.roc_auc_score(y_test, predictions)
auc_list.append(auc)
plt.figure(figsize = (10,10))
# Coefficient graph
plt.subplot(2,2,1)
plt.plot(C, coef_list)
plt.xscale("log")
plt.title('Non-zero Coefficients per model')
plt.xlabel('Log of (C) parameter')
plt.ylabel('Number of non-zero Coefficients')
plt.show()
# Cost List
plt.figure(figsize = (10,10))
plt.subplot(2,2,2)
plt.plot(C, cost_list)
plt.xscale('log')
plt.title('Cost Function Score')
plt.xlabel('Log of (C) parameter')
plt.ylabel('Number of non-zero Coefficients')
plt.show()
# F1 - Score Plot
plt.figure(figsize = (10,10))
plt.subplot(2,2,3)
plt.plot(C, f1_list)
plt.xscale('log')
plt.title('F1 - Score')
plt.xlabel('Log of (C) parameter')
plt.ylabel('F1 Score')
plt.show()
# AUC Curve
plt.figure(figsize = (10,10))
plt.subplot(2,2,4)
plt.plot(C, auc_list)
plt.xscale('log')
plt.title('AUC Curve')
plt.xlabel('Log of (C) parameter')
plt.ylabel('Area Under the Curve')
plt.show()
d.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import random
random.seed(20)
###(1) logistic regression classifier with minimal regularization (using the Scikit Learn package, set penalty='l1', C=1e100 to approximate this)
model1 = LogisticRegression(penalty='l1', C=1e100, solver = 'liblinear')
model1.fit(X_train, y_train)
predictions = model1.predict_proba(X_test)[:,1]
# AUC Curve
mod1fpr, mod1tpr, thresholds1 = metrics.roc_curve(y_test, predictions, pos_label=1)
auc_mod1 = metrics.roc_auc_score(y_test, predictions)
###(2) a logistic regression classifier with the best value of the regularization parameter from the last section,
model2 = LogisticRegression(penalty='l1', C=10**(-2), solver = 'liblinear')
model2.fit(X_train, y_train)
predictions2 = model2.predict_proba(X_test)[:,1]
#AUC Curve
mod2fpr, mod2tpr, thresholds2 = metrics.roc_curve(y_test, predictions2, pos_label=1)
auc_mod2 = metrics.roc_auc_score(y_test, predictions2)
###(3) a Linear Discriminant Analysis (LDA) Classifier
LDA = LinearDiscriminantAnalysis()
LDA.fit(X_train, y_train)
LDApredictions = LDA.predict_proba(X_test)[:,1]
#AUC Curve
ldafpr, ldatpr, thresholds3 = metrics.roc_curve(y_test, LDApredictions, pos_label=1)
auc_lda = metrics.roc_auc_score(y_test, LDApredictions)
###(4) a Random Forest (RF) classifier (using default parameters for the LDA and RF classifiers)
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)
CLFpredictions = clf.predict_proba(X_test)[:,1]
#AUC Curve
rfcfpr, rfctpr, thresholds4 = metrics.roc_curve(y_test, CLFpredictions, pos_label=1)
auc_rfc = metrics.roc_auc_score(y_test, CLFpredictions)
# ROC Curve
# Model 1 ROC Curve plot
plt.figure(figsize = (10,10))
#Plot all models
plt.plot(mod1fpr, mod1tpr,label = 'Model1 Logistic Regression AUC = %0.2f' % auc_mod1)
plt.plot(mod2fpr, mod2tpr,label = 'Model 2 Logistic Regression AUC = %0.2f' % auc_mod2)
plt.plot(ldafpr, ldatpr,label = 'LDA AUC = %0.2f' % auc_lda)
plt.plot(rfcfpr, rfctpr,label = 'Random Forrest Classifier AUC = %0.2f' % auc_rfc)
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Random guess')
plt.legend(loc = 'lower right')
plt.title('ROC Curve For Different Classifiers')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid('on')
plt.axis('square')
plt.show()
from sklearn.metrics import precision_recall_curve
# Precision Recall Log Reg Model1
mod1prec, mod1recall, mod1thresh = precision_recall_curve(y_test, predictions, pos_label =1)
# Precision Recall Log Reg Model2
mod2prec, mod2recall, mod2thresh = precision_recall_curve(y_test, predictions2, pos_label =1)
# Precision Recall LDA
ldaprec, ldarecall, ldathresh = precision_recall_curve(y_test, LDApredictions, pos_label =1)
# Precision Recall Log Reg Model1
rfcprec, rfcrecall, rfcthresh = precision_recall_curve(y_test, CLFpredictions, pos_label =1)
#random line with the percent of positive y's
rand = (sum(y)/len(y))
### Plot the Precision Recall curves
plt.figure(figsize=(15,15))
plt.plot(mod1recall, mod1prec, label = 'Logistic Regresstion Model 1')
plt.plot(mod2recall, mod2prec, label = 'Logistic Regresstion Model 2')
plt.plot(ldarecall, ldaprec, label = 'LDA Model')
plt.plot(rfcrecall, rfcprec, label = 'Random Forrest Classifier')
plt.plot([0,1],[rand,rand],linestyle="--", linewidth=1, color='red', label="random")
plt.ylabel("Precision")
plt.xlabel("Recall")
plt.title('Precision-Recall Curve for Multiple Classifiers')
plt.legend(loc = 'lower right')
plt.show()
In this section I created four different models to classify the data. The first model was a Lasso Logistic Regression with minimal regularization. The second model was also a logistic regression, but this time using the best regularization term from C, which I judged to be $ 10^{-2} $. The third model was a Linear Discriminant Analysis classifier, and the final model was a Random Forest Classifier. After training, fitting, and using these models to create predictions- I plotted their ROC and Precision Recall curves which can be seen above.
An AUC curve stands for Area Under the Curve, and is used as an indicator of a model's accuracy with True Positive rate on the Y-axis, and False Positive rate on the X-axis. The best models will have a true positive rate close to one, and a false positive rate close to 0, meaning that the further to the upper left the curve is, the better the model is. The area of that is beneath and to the right of the curve is "under the curve", and thus the larger the area under the curve is, the more accurate the model is deemed to be.
A Precision-Recall curve shares the same general concepts of an AUC, But the precision score is on the Y-axis, and recall score is on the x-axis. Precision is calculated by taking the total true positives and dividing by the sum of true positives and false positives. Recall is calculated by taking the total true positives and dividing by the sum of true positives and false negatives. We want to maximize both the precision and recall scores, so the best PR curves are in the upper right of the graph.
For the AUC scores, the two Logistic Regression classifiers performed almost identically, with both achieving an Area Under the Curve score of 0.98. The Linear Discriminant Analysis classifier was not far behind with an AUC of 0.97. The Random Forest Classifier performed the worst with an AUC of 0.94.
We saw the same general trends reflected in the PR-curve, with the two logistic regression models performing best with LDA not far behind, and the random forest classifier performing worst. In this graph, the second logistic regression model with a regularization of $10^{-2}$ seems to have performed slightly better than the default model with minimal regularization.
Because of its slightly better performance on the PR curve, I believe the second Lasso Logistic Regression model was the best of these options. I would choose to apply it to an unseen data set because it appears to perform the best on test data. However the differences in performance from the logistic regression model with minimal regularization is very slight.
The phrase "Bayes' decision rule" is often used to describe a classifier decision rule that minimizes misclassification rate (equally penalizing false positives and false negatives) for a given problem. In this exercise you will first determine the Bayes' decision rule for a binary classification problem where you know the likelihood of data from each class.
This binary classification problem has two target classes with data distributed as exponential random variables:
$$P(x|C_i) = \lambda_i e^{-\lambda_i x}$$Where $C_i$ represents the class from which the sample is drawn (0 or 1). This is known as the class-conditional likelihood, not surprisingly because it is the likelihood of the data conditioned on knowing what class it came from. This represents two separate density functions: one for the case when the class is 0 ($P(x|C_0)$)and one for when the class is 1 ($P(x|C_1)$). Assume that we know that $\lambda_0 = 5$ and $\lambda_1 = 1$ to fully-specify those distributions.
(a) Plot the probability of each class conditional distribution (e.g. likelihood function), $P(x|C_0)$ and $P(x|C_1)$ on the sample plot in the domain $x \in [0,2]$. You can use scipy's expon module for this. Note that the scale parameter for this module is defined as $1/\lambda$.
(b) Assuming the prior class distributions are $P(C_0)=P(C_1)=0.5$, determine the Bayes' decision rule using the information above. Remember that the Bayes Decision rule can be defined using the posterior distributions of the data; when $P(Y=1|x)>P(Y=0|x)$, predict Class 1, otherwise predict Class 0. In that way, you will assign the most probable class to the data based on the value of $x$. The decision rule will then be of the form:
If $x > x^*$, then predict Class 1, otherwise predict Class 0
Determine the value $x^*$ that minimizes misclassification (equally penalizing false positives and false negatives, and no penalty/reward for correct detections). Show your work in deriving this value.
(c) How does your answer in (b) relate to the the plot you made in (a)?
(d) What if instead, $P(C_1)=2P(C_0)$; what would the new value of $x^*$ be in this case? Before computing the value, think through how you would expect it to change, then see if the math supports your conclusion.
(e) Load the test data in the file A3_Q3_test.csv, which follows the distributions above. Apply your Bayes decision rule to the data. What is the misclassification rate (error rate, or fraction of misclassified samples) of this decision rule? This should be the best that any algorithm could achieve (on average).
(f) Load the training data in the file A3_Q3_train.csv (which follows the distributions above) and train a logistic regression classifier on the data (using default parameters) from Scikit-Learn's LogisticRegression module. What is your misclassification error for your test dataset? How does this compare with the Bayes' classifier?
(g) What is your decision rule for the logistic regression model you just trained? To compute this, extract the parameters from your fit model (look for the coef_ and intercept_ attributes) and since the classes are balanced, the decision rule will be to classify a sample $x$ as Class 1 when your logistic regression sigmoid is greater than 0.5 (the halfway point from the two extremes of 0 and 1), since we assume $P(C_1|x)=\sigma(w_0 + w_1 x)$ in logistic regression. How does the decision rule from logistic regression compare with the Bayes' classifier?
ANSWER
a.
from scipy.stats import expon
# X in domain of [0,2] is the linspace of 0,2 - and create the distributions for Lambda = 1 and 5
lspace = np.linspace(0,2)
lambda1 = expon.pdf(lspace, scale = 1)
lambda5 = expon.pdf(lspace, scale = (1/5))
# Plot the two curves created above
plt.plot(lspace, lambda1, label='Lambda = 1')
plt.plot(lspace, lambda5, label='Lambda = 5')
plt.title('Probability of Class 0 and Class1 Conditional Distribution')
plt.xlabel('Linspace Value')
plt.ylabel('Probability')
plt.legend()
plt.show()
b.
c.
The decsision boundry signifies the place where the probability of an obseration belonging to a given class switches from one to the other. The graph in part A signifies the respective probability of belonging to the two classes given a value of X. The value calculated in part B is the value for X where the probabilites intersect and have an equal probability of being hte correct class for the observation. For any x values less that .4024, the class where lambda is 5 has a higher likelihood of being the correct class. For values greater than .4024, the class where lambda is one has a higher probability of being the correct class.
d.
If $P(C_{1}) = 2P(C_{0}), P(C_{1})$ should be increasing in value leading to x, the decision boundry, to increase. My equation for finding the new decision boudnry can be seen below:
$$ \frac{P(x|Y=1)P(Y=1)}{P(x)} = \frac{P(x|Y=0)P(Y=1)}{2P(x)} $$$$ 2P(x) \frac{P(x|Y=1)P(Y=1)}{P(x)} = 2P(x)\frac{P(x|Y=0)P(Y=1)}{2P(x)} $$$$ 2P(x){P(x|Y=1)} = {P(x|Y=0)} $$$$ 2\lambda_{5}e^{-\lambda_{5}x} = \lambda e^{-\lambda_{1} x} $$$$ 10e^{-5x} = e^{-x} $$$$ log(10) + log(e^{-5x}) = log(e^{-x}) $$$$ log(10) = 4x $$$$ \frac{log(10)}{4} = x $$$$ x = 0.58 $$The math supports my initial hypothesis that $P(C_{1}) = 2P(C_{0})$ would increase the x value of decision boundry which increased from .4026 to .58
e.
# Import the test Data
test = pd.read_csv('./Data/A3_Q3_test.csv')
# Loop through values in the column of x, if they are larger than the .4024 calculated above, classify as 1
classifier = []
for row in test['x']:
if row > .4024 : classifier.append(1)
else:
classifier.append(0)
# Append classification rates to the Data frame
test['classifier'] = classifier
# Calculate missclassified observations if y does not equal classification
misclass = sum(abs(test['y']-test['classifier']))
N = len(test)
misclass_rate = misclass/N
misclass_rate
0.23395
The misclassification rate on the Bayes classifier I calculated above is .23395, or 23.395%.
f.
# Import the train Data
train = pd.read_csv('./Data/A3_Q3_train.csv')
# Split the data and reshape it to work with the function
x_train = train.iloc[:,1].values
y_train = train.iloc[:,2].values
x_test = test.iloc[:,1].values
y_test = test.iloc[:,2].values
x_train = x_train.reshape(-1, 1)
y_train = y_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
#Create, fit and do predictions with Logistic Regression model
logreg = LogisticRegression()
logreg.fit(x_train, y_train.ravel())
predictions = logreg.predict(x_test)
# Count the number of incorrect predictions and divide by total observations
misclass2 = sum(abs(predictions-y_test))
misclass2/N
0.234
The missclassification rate in this model is .234, which is just barely larger than the Bayes classication error I calculated in part E. The Bayes classifier should hypothetically be the best you can do, so having a misclassification rate so similar to it signifies that logistic regression is a good model for this data.
g.
print(f'Intercept: {logreg.intercept_}, Coefficient:{logreg.coef_}, Decision Boundry:{(-logreg.intercept_/logreg.coef_)}' )
Intercept: [-1.60348178], Coefficient:[[3.97552165]], Decision Boundry:[[0.40333871]]
# Creacte ROC scores
fpr2, tpr2, threshold = roc_curve(y_test, model.predict_proba(X_test)[:,1])
# Plot ROC Scores
plt.figure(figsize = (10,10))
plt.plot(fpr2, tpr2,label = 'Model1 Logistic Regression AUC = %0.2f')
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Random guess')
plt.legend(loc = 'lower right')
plt.title('ROC Curve For Different Classifiers')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid('on')
plt.axis('square')
plt.show()
The decision boundry for the model is at 0.4034 which is extremely close to the .4024 I calculated for the Bayes decision boundry in part B. This explains why the misclassification rate is virtually identicle, and knowing that the Bayes decision boundry is the best we can do it suggests that logistic regression is a good choice for this data.