Machine Learning Basics

Caleb O'Neel

Learning Objectives:

By successfully completing this assignment you will be able to...

Conceptual Questions

1

[5 points] For each part (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

  1. The sample size $n$ is extremely large, and the number of predictors $p$ is small.

I would generally expect a flexible statistical learning method to perform better in this scenario. With an extremely large sample size and few predictors, a flexible approach would probably reduce bias significantly with only minimal compromises in variance due to the significant sample size which would smooth the decision boundry.

  1. The number of predictors $p$ is extremely large, and the number of observations $n$ is small.

Small samples are more prone to overfitting. Overfitting is a much larger problem for highly flexible models, and will likely lead to massive issues with variance. I would expect a flexible model to perform worse in this scenario, and an inflexible model to perform better.

  1. The relationship between the predictors and response is highly non-linear.

A flexible model would be better in this scenario. The less flexible a model is, the more linear it becomes. If the relationship is highly non-linear, we do not want the decision boundry to be linear.

  1. The variance of the error terms, i.e. $\sigma^2 = Var(\epsilon)$, is extremely high

If there is large variance in the observations, a flexible model is more at risk of being influinced by the excess variance. I would expect a flexible model to perform worse in these conditions.

2

[5 points] For each of the following, (i) explain if each scenario is a classification or regression problem, (ii) indicate whether we are most interested in inference or prediction for that problem, and (iii) provide the sample size $n$ and number of predictors $p$ indicated for each scenario.

ANSWER

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

(i) This problem is better geared towards regression. The response variable is continuous and not a category. In this case a multiple linear regression would answer this best.

(ii) Inference - we want to understand a relationship between independent variables and CEO salaty

(iii) $n$ = 500, $p$ = 3

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

(i)Classification would be the better aproach in this scenario. Here, we want to classify a product launch into one of two categories: success or failure. A Classification model takes all the input variables, uses the model with in to generate a prediction, and translates this prediction to one of the classes available.

(ii) Prediction

(iii) n = 20, p = 13

(c) We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.

If you want to understand the percent change of a response variable, then regression is a better model since we are not classifying an outcome. We want to see how something changes in relation to other variables, and predict based off that, but not put it into a "category".

(ii) Prediction

(iii) n = 52, p = 3

Practical Questions

3

[10 points] Classification II. The table below provides a training dataset containing six observations ($n=6$), three predictors ($p=3$), and one qualitative response variable.

Table 1. Dataset with $n=6$ observations in $p=3$ dimensions with a categorical response, $y$

Obs. $x_1$ $x_2$ $x_3$ $y$
1 0 3 0 Red
2 2 0 0 Red
3 0 1 3 Red
4 0 1 2 Blue
5 -1 0 1 Blue
6 1 1 1 Red

We want to use this dataset to make a prediction for $y$ when $x_1=x_2=x_3=0$ using $K$-nearest neighbors. You are given some code below to get you started. Note: coding is only required for part (a), for (b)-(d) please provide your reasoning.

(a) Compute the Euclidean distance between each observation and the test point, $x_1=x_2=x_3=0$. Present your answer in a table similar in style to Table 1 with observations 1-6 as the row headers.

Obs. $x_1$ $x_2$ $x_3$ $y$ $Distance$
1 0 3 0 Red 3.00
2 2 0 0 Red 2.00
3 0 1 3 Red 3.16
4 0 1 2 Blue 2.24
5 -1 0 1 Blue 1.14
6 1 1 1 Red 1.73

(b) What is our prediction with $K=1$? Why?

K = 1 means that we will classify based on the single closest neighbor. In this case, observation 5 is the closest and is Blue so we would classify point (0, 0, 0) to be Blue as well.

(c) What is our prediction with $K=3$? Why?

The three closest neighbors to the origin are observations 2, 5 and 6. Observation 5 is blue, and observations 2 and 6 are red. Since there are more reds, the point would be predicted to be red using K = 3.

(d) If the Bayes decision boundary (the optimal decision boundary) in this problem is highly nonlinear, then would we expect the best value of $K$ to be large or small? Why?

For a highly non-linear relationship, a small K value is generally better. This allows for more flexibility and less bias. The higher the K value, the more linear the decision boundry becomes.

ANSWER:

4

[20 points] Classification I: Creating a classification algorithm.

(a) Build a working version of a binary kNN classifier using the skeleton code below.

(b) Load the datasets to be evaluated here. Each includes training features ($\mathbf{X}$), and test features ($\mathbf{y}$) for both a low dimensional ($p = 2$ features/predictors) and a high dimensional ($p = 100$ features/predictors). For each of these datasets there are $n=100$ observations of each. They can be found in the data subfolder in the assignments folder on github. Each file is labeled similar to A2_X_train_low.csv, which lets you know whether the dataset is of features, $X$, targets, $y$; training or testing; and low or high dimensions.

(c) Train your classifier on first the low dimensional dataset and then the high dimensional dataset with $k=5$. Evaluate the classification performance on the corresponding test data for each. Calculate the time it takes to make the predictions in each case and the overall accuracy of each set of test data predictions.

(d) Compare your implementation's accuracy and computation time to the scikit learn KNeighborsClassifier class. How do the results and speed compare?

(e) Some supervised learning algorithms are more computationally intensive during training than testing. What are the drawbacks of the prediction process being slow?

ANSWER:

(b)

(c) Train your classifier on first the low dimensional dataset and then the high dimensional dataset with $k=5$. Evaluate the classification performance on the corresponding test data for each. Calculate the time it takes to make the predictions in each case and the overall accuracy of each set of test data predictions.

(d) Compare your implementation's accuracy and computation time to the scikit learn KNeighborsClassifier class. How do the results and speed compare?

(e) Some supervised learning algorithms are more computationally intensive during training than testing. What are the drawbacks of the prediction process being slow?

If a prediction process is slow, it lessens the model's value. Answers are often required instantaneously and if a model takes hours to run, it becomes too slow to be useful. Additionally it is expensive to run a computationally heavy model for long periods.

For example, if a day trader were to use an algorithm to take a series of variables at the start of a day and predict how the price of a stock will move in a given day and it took twelve hours to run the model, the information would no longer be useful no matter how accurate. Timing is a crucial element for most applications of Machine Learning.

There can be scenario's where it is acceptable for a machine to run a long time to make predictions. If input do not change and the information is not immidiately needed, prioritizing accuracy at the expense of speed can be beneficial. But in most business settings speed is important.

5

[20 points] Bias-variance tradeoff I: Understanding the tradeoff. This exercise will illustrate the impact of the bias-variance tradeoff on classifier performance by looking at classifier decision boundaries.

ANSWER

(a) Create a synthetic dataset (with both features and targets). Use the make_moons module with the parameter noise=0.35 to generate 1000 random samples.

(b) Scatterplot your random samples with each class in a different color

(c) Create 3 different data subsets by selecting 100 of the 1000 data points at random three times. For each of these 100-sample datasets, fit three k-Nearest Neighbor classifiers with: $k = \{1, 25, 50\}$. This will result in 9 combinations (3 datasets, with 3 trained classifiers).

(d) For each combination of dataset trained classifier, in a 3-by-3 grid, plot the decision boundary (similar in style to Figure 2.15 from Introduction to Statistical Learning). Each column should represent a different value of $k$ and each row should represent a different dataset.

(e) What do you notice about the difference between the rows and the columns. Which decision boundaries appear to best separate the two classes of data? Which decision boundaries vary the most as the data change?

The larger that K gets, the more linear the decision boundry becomes. This aligns with expectations because a smaller k has more flexibility, and conforms closer to the particular dataset. This flexibility makes it appear to fit the training data the closest, but often leads to issues with excessively high variance on test data.

For each individual sample, the K = 1 looks like the closest fit. However, the shape of the K = 1 in each sample is highly variable, suggesting that it is overfit to the individual samples. When looking at the data in aggregate I beleive the K = 25 does the best job of fitting to the data.

(f) Explain the bias-variance tradeoff using the example of the plots you made in this exercise.

The bias variance tradeoff the concept that bias and variance are inversely correlated, when one increases the other decreases. However, the increase or decrease is not necessaily porportionate. The is a point when the combined bias and variance is at its lowest, and the best models idealy want to be set to the parameters corresponding to that point.

A model is said to have less bias when it is more flexible. In our example here, when K = 1 there is very little bias. There is, however, an extreme amount of variance. As you can see from the graphs where k = 1 the line is very jagged and branches off in many directions. The true bayes decision boundry is likely much smoother, meaning there will be a large variace created by having such a flexible decision boundry.

Conversely, looking at the graphs where K is equal to 50, the line becomes significantly more linear. This linearity leads to less variance in the error terms, but has significantly more bias. It lacks flexibility to the point it responds very little to data trends and only provices an almost straight line.

In the graphs where k =25, you see a middle ground approach. There is much more shape and less bias to the line than where k = 50,l but it remains much smoother than the decision boundry for k = 1. This balances bias and variance more than the

6

[20 points] Bias-variance trade-off II: Quantifying the tradeoff. This exercise will explore the impact of the bias-variance tradeoff on classifier performance by looking at classifier decision boundaries.

Here, the value of $k$ determines how flexible our model is.

ANSWER

(a) Using the function created earlier to generate random samples (using the make_moons function), create a new set of 1000 random samples, and call this dataset your test set and the previously created dataset your training set.

(b) Train a kNN classifier on your training set for $k = 1,2,...500$. Apply each of these trained classifiers to both your training dataset and your test dataset and plot the classification error (fraction of mislabeled datapoints).

(c) What trend do you see in the results?

The Error rate plummets at the beggining from the small K's and generally gets smaller until around K = 100 were it begins to steadily increase. The error rate is at its lowest and fairly level from around K = 60 to k = 100, and is at its highest when K = 1, closely followed by k = 500.

The error rate for the test set behaves differently. There is zero error whe k = 1, because the model is tested and trained on the same data. More flexible models overfit to the data, and if this exact same data is the test data it will perform perfectly even though it is unlikely to be effective on a new set of data.

(d) What values of $k$ represent high bias and which represent high variance?

K = 1 represents high variance, and K = 500 represents high bias.

(e) What is the optimal value of $k$ and why?

The optimal value of K is 13 for this data set. This has the lowest error rate in this dataset at 0.105.

(f) In kNN classifiers, the value of k controls the flexibility of the model - what controls the flexibility of other models?

In other models - more specifically linear regression, the flexibility of a model is determined by degrees of freedom. The degrees of freedom incease as the number of parameters in a model increase. The parameters in this context are the number of additional independent variables in the model.

7

[20 points] Linear regression and nonlinear transformations. You're given training and testing data contained in files "A2_Q7_train.csv" and "A2_Q7_test.csv" in the "data" folder for this assignment. Your goal is to develop a regression algorithm from the training data that performs well on the test data.

Hint: Use the scikit learn LinearRegression module.

ANSWER

(a) Create a scatter plot of your training data.

(b) Estimate a linear regression model ($y = a_0 + a_1 x$) for the training data and calculate both the $R^2$ value and mean square error for the fit of that model for the training data. Also provide the equation representing the estimated model (e.g. $y = a_0 + a_1 x$, but with the estimated coefficients inserted. Consider this your baseline model against which you will compare other model options.

(c) If features can be nonlinearly transformed, a linear model may incorporate those non-linear feature transformation relationships in the training process. From looking at the scatter plot of the training data, choose a transformation of the predictor variable, $x$ that may make sense for these data. This will be a multiple regression model of the form $y = a_0 + a_1 x_1 + a_2 x_2 + \ldots + a_n x_n$. Here $x_i$ could be any transformations of x - perhaps it's $\frac{1}{x}$, $log(x)$, $sin(x)$, $x^k$ (where $k$ is any power of your choosing). Provide the estimated equation for this multiple regression model (e.g. if you chose your predictors to be $x_1 = x$ and $x_2 = log(x)$, your model would be of the form $y = a_0 + a_1 x + a_2 log(x)$. Also provide the $R^2$ and mean square error of the fit for the training data.

(d) Using both of the models you created here in (b) and (c), plot the original data (as a scatter plot) and the two curves representing your models (each as a separate line).

(e) Using the models above, apply them to the test data and estimate the $R^2$ and mean square error of the test dataset.

(f) Which models perform better on the training data, and which on the test data? Why?

The transformed model worked significantly better on the training data, and slightly better on the test data. The model performed significantly better on the training data because it had more flexibility. In a regression model, the more degrees of freedom a model has the more flexible it becomes. Generally, the more flexible a model is the better it performs on training data.

For the test data, the two models were almost identicle with the transformed data performing slightly better in terms of MSE. The linear model performs poorly because the data does not appear to be linear but the model is. The transformed model is non linear and thus has a slight advantage in this regaurd.

(g) Imagine that the test data were significantly different from the training dataset. How might this affect the predictive capability of your model? Why?

If the test data is significantly different from the training data, the model is likely to perform very poorly on it. The model is created to predict data that follows a similar distribution as the training. If the test data does not look like this training data, than the model is far less effective. The training and test data must be similar for the model to be effective.