By successfully completing this assignment you will be able to...
scikit-learnscikit-learn supervised learning technique to data and make predictions using it# MAC USERS TAKE NOTE:
# For clearer plots in Jupyter notebooks on macs, run the following line of code:
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import numpy as np
from scipy.spatial import distance
from scipy import stats
import pandas as pd
from statistics import mode
import sklearn.datasets
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns
from matplotlib.colors import ListedColormap
import random
import math
from sklearn.linear_model import LinearRegression
import sklearn.metrics as metrics
from sklearn.metrics import classification_report
import time
[5 points] For each part (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.
I would generally expect a flexible statistical learning method to perform better in this scenario. With an extremely large sample size and few predictors, a flexible approach would probably reduce bias significantly with only minimal compromises in variance due to the significant sample size which would smooth the decision boundry.
Small samples are more prone to overfitting. Overfitting is a much larger problem for highly flexible models, and will likely lead to massive issues with variance. I would expect a flexible model to perform worse in this scenario, and an inflexible model to perform better.
A flexible model would be better in this scenario. The less flexible a model is, the more linear it becomes. If the relationship is highly non-linear, we do not want the decision boundry to be linear.
If there is large variance in the observations, a flexible model is more at risk of being influinced by the excess variance. I would expect a flexible model to perform worse in these conditions.
[5 points] For each of the following, (i) explain if each scenario is a classification or regression problem, (ii) indicate whether we are most interested in inference or prediction for that problem, and (iii) provide the sample size $n$ and number of predictors $p$ indicated for each scenario.
ANSWER
(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
(i) This problem is better geared towards regression. The response variable is continuous and not a category. In this case a multiple linear regression would answer this best.
(ii) Inference - we want to understand a relationship between independent variables and CEO salaty
(iii) $n$ = 500, $p$ = 3
(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.
(i)Classification would be the better aproach in this scenario. Here, we want to classify a product launch into one of two categories: success or failure. A Classification model takes all the input variables, uses the model with in to generate a prediction, and translates this prediction to one of the classes available.
(ii) Prediction
(iii) n = 20, p = 13
(c) We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.
If you want to understand the percent change of a response variable, then regression is a better model since we are not classifying an outcome. We want to see how something changes in relation to other variables, and predict based off that, but not put it into a "category".
(ii) Prediction
(iii) n = 52, p = 3
[10 points] Classification II. The table below provides a training dataset containing six observations ($n=6$), three predictors ($p=3$), and one qualitative response variable.
Table 1. Dataset with $n=6$ observations in $p=3$ dimensions with a categorical response, $y$
| Obs. | $x_1$ | $x_2$ | $x_3$ | $y$ |
|---|---|---|---|---|
| 1 | 0 | 3 | 0 | Red |
| 2 | 2 | 0 | 0 | Red |
| 3 | 0 | 1 | 3 | Red |
| 4 | 0 | 1 | 2 | Blue |
| 5 | -1 | 0 | 1 | Blue |
| 6 | 1 | 1 | 1 | Red |
We want to use this dataset to make a prediction for $y$ when $x_1=x_2=x_3=0$ using $K$-nearest neighbors. You are given some code below to get you started. Note: coding is only required for part (a), for (b)-(d) please provide your reasoning.
X = np.array([[ 0, 3, 0],
[ 2, 0, 0],
[ 0, 1, 3],
[ 0, 1, 2],
[-1, 0, 1],
[ 1, 1, 1]])
y = np.array(['r','r','r','b','b','r'])
(a) Compute the Euclidean distance between each observation and the test point, $x_1=x_2=x_3=0$. Present your answer in a table similar in style to Table 1 with observations 1-6 as the row headers.
columns = ['Distance']
df = pd.DataFrame(columns=columns)
origin = [0, 0, 0]
for cords in X:
dst = distance.euclidean(origin, cords)
df = df.append({'Distance': dst}, ignore_index=True)
df.index = np.arange(1, len(df) + 1)
df.index.name = 'Obs.'
| Obs. | $x_1$ | $x_2$ | $x_3$ | $y$ | $Distance$ |
|---|---|---|---|---|---|
| 1 | 0 | 3 | 0 | Red | 3.00 |
| 2 | 2 | 0 | 0 | Red | 2.00 |
| 3 | 0 | 1 | 3 | Red | 3.16 |
| 4 | 0 | 1 | 2 | Blue | 2.24 |
| 5 | -1 | 0 | 1 | Blue | 1.14 |
| 6 | 1 | 1 | 1 | Red | 1.73 |
(b) What is our prediction with $K=1$? Why?
K = 1 means that we will classify based on the single closest neighbor. In this case, observation 5 is the closest and is Blue so we would classify point (0, 0, 0) to be Blue as well.
(c) What is our prediction with $K=3$? Why?
The three closest neighbors to the origin are observations 2, 5 and 6. Observation 5 is blue, and observations 2 and 6 are red. Since there are more reds, the point would be predicted to be red using K = 3.
(d) If the Bayes decision boundary (the optimal decision boundary) in this problem is highly nonlinear, then would we expect the best value of $K$ to be large or small? Why?
For a highly non-linear relationship, a small K value is generally better. This allows for more flexibility and less bias. The higher the K value, the more linear the decision boundry becomes.
ANSWER:
[20 points] Classification I: Creating a classification algorithm.
(a) Build a working version of a binary kNN classifier using the skeleton code below.
(b) Load the datasets to be evaluated here. Each includes training features ($\mathbf{X}$), and test features ($\mathbf{y}$) for both a low dimensional ($p = 2$ features/predictors) and a high dimensional ($p = 100$ features/predictors). For each of these datasets there are $n=100$ observations of each. They can be found in the data subfolder in the assignments folder on github. Each file is labeled similar to A2_X_train_low.csv, which lets you know whether the dataset is of features, $X$, targets, $y$; training or testing; and low or high dimensions.
(c) Train your classifier on first the low dimensional dataset and then the high dimensional dataset with $k=5$. Evaluate the classification performance on the corresponding test data for each. Calculate the time it takes to make the predictions in each case and the overall accuracy of each set of test data predictions.
(d) Compare your implementation's accuracy and computation time to the scikit learn KNeighborsClassifier class. How do the results and speed compare?
(e) Some supervised learning algorithms are more computationally intensive during training than testing. What are the drawbacks of the prediction process being slow?
ANSWER:
# (a) Write your own kNN classifier
class Knn:
# k-Nearest Neighbor class object for classification training and testing
def __init__(self):
self.train= None
self.ytarget = None
def fit(self, x, y):
# Save the training data to properties of this class
self.train = x.values
self.ytarget = y.values
def predict(self, x, k):
y_hat = [] # Variable to store the estimated class label for
# Calculate the distance from each vector in x to the training data
# Convert x values to a format that allows for easier manipulation
x = x.values
for cord in x:
# Calculate the Euclidian distance between the first X coordinate and all cords in training data
dst = np.sqrt(np.sum((cord - self.train)**2, axis = 1))
#Sort the list of distances and take the index numbers of the k-smallest values
dst_sort = np.argsort(dst)[0:k]
# Get the Y classifiers for the corresponding distances, and select the most common classification
target = self.ytarget[dst_sort]
y_target = stats.mode(target)
#Reformat to a single integer, append it to y_hat before taking the next coord of X and repeating process
y_target = y_target[0][0].tolist()
#y_array = np.array(y_target)
y_hat.append(y_target)
# For Accuracy, list format must be changed to be flat
y_hat = [item for sublist in y_hat for item in sublist]
# Return the estimated targets
return y_hat
# Metric of overall classification accuracy
# (a more general function, sklearn.metrics.accuracy_score, is also available)
def accuracy(y,y_hat):
nvalues = len(y)
accuracy = sum(y == y_hat) / nvalues
return accuracy
(b)
# Load the Data Sets
x_test_high = pd.read_csv('Data/A2_X_test_high.csv', header=None)
x_test_low = pd.read_csv('Data/A2_X_test_low.csv', header=None)
x_train_high = pd.read_csv('Data/A2_X_train_high.csv', header=None)
x_train_low = pd.read_csv('Data/A2_X_train_low.csv', header=None)
y_test_high = pd.read_csv('Data/A2_y_test_high.csv', header=None)
y_test_low = pd.read_csv('Data/A2_y_test_low.csv', header=None)
y_train_low = pd.read_csv('Data/A2_y_train_low.csv', header=None)
y_train_high = pd.read_csv('Data/A2_y_train_high.csv', header=None)
(c) Train your classifier on first the low dimensional dataset and then the high dimensional dataset with $k=5$. Evaluate the classification performance on the corresponding test data for each. Calculate the time it takes to make the predictions in each case and the overall accuracy of each set of test data predictions.
# Evaluate the performance of your kNN classifier on a low- and a high-dimensional dataset
# and time the predictions of each
### Low demensional data set
model = Knn()
model.fit(x_train_low, y_train_low)
t0 = time.time()
pred = model.predict(x_test_low, 5)
t1 = time.time()
low_time = round(t1 - t0,4)
#Without converting to values accuracy function struggles
y_test = y_test_low[0].values.ravel()
low_acc = accuracy(y_test, pred)
### High demensional data set
model = Knn()
model.fit(x_train_high, y_train_high)
t2 = time.time()
highpred = model.predict(x_test_high, 5)
t3 = time.time()
high_time = round(t3 - t2,4)
#Without converting to values accuracy function struggles
y_test_high = y_test_high[0].values.ravel()
high_acc = accuracy(y_test_high, highpred)
print(f'The low dimensional model has an accuracy of {low_acc} and takes {low_time} to run')
print(f'The high dimensional model has an accuracy of {high_acc} and takes {high_time} to run')
The low dimensional model has an accuracy of 0.925 and takes 0.2073 to run The high dimensional model has an accuracy of 0.993 and takes 0.3624 to run
(d) Compare your implementation's accuracy and computation time to the scikit learn KNeighborsClassifier class. How do the results and speed compare?
# Time SKLearn model Low dim
skm = KNeighborsClassifier(n_neighbors=5)
skm.fit(x_train_low.values, y_train_low.values.ravel())
t4 = time.time()
skm_pred = skm.predict(x_test_low.values)
t5 = time.time()
skm_time = round(t5 - t4,4)
y_test = y_test_low[0].values.ravel()
lowskm_acc = accuracy(y_test, skm_pred)
# Time SKLearn model high dim
skm = KNeighborsClassifier(n_neighbors=5)
skm.fit(x_train_high.values, y_train_high.values.ravel())
t4 = time.time()
highskm_pred = skm.predict(x_test_high.values)
t5 = time.time()
highskm_time = round(t5 - t4,4)
y_test = y_test_low[0].values.ravel()
highskm_acc = accuracy(y_test, skm_pred)
print(f'The low dimensional model has an accuracy of {lowskm_acc} and takes {skm_time} to run')
print(f'The high dimensional model has an accuracy of {highskm_acc} and takes {highskm_time} to run')
The low dimensional model has an accuracy of 0.925 and takes 0.0281 to run The high dimensional model has an accuracy of 0.925 and takes 0.1857 to run
(e) Some supervised learning algorithms are more computationally intensive during training than testing. What are the drawbacks of the prediction process being slow?
If a prediction process is slow, it lessens the model's value. Answers are often required instantaneously and if a model takes hours to run, it becomes too slow to be useful. Additionally it is expensive to run a computationally heavy model for long periods.
For example, if a day trader were to use an algorithm to take a series of variables at the start of a day and predict how the price of a stock will move in a given day and it took twelve hours to run the model, the information would no longer be useful no matter how accurate. Timing is a crucial element for most applications of Machine Learning.
There can be scenario's where it is acceptable for a machine to run a long time to make predictions. If input do not change and the information is not immidiately needed, prioritizing accuracy at the expense of speed can be beneficial. But in most business settings speed is important.
[20 points] Bias-variance tradeoff I: Understanding the tradeoff. This exercise will illustrate the impact of the bias-variance tradeoff on classifier performance by looking at classifier decision boundaries.
ANSWER
(a) Create a synthetic dataset (with both features and targets). Use the make_moons module with the parameter noise=0.35 to generate 1000 random samples.
x, y = sklearn.datasets.make_moons(n_samples = 1000, noise = 0.35)
(b) Scatterplot your random samples with each class in a different color
# create a scatterplot with the two columns of coordinates of x as the X and Y axis, amd the Y as the classifier
fig, ax = plt.subplots()
scatter = plt.scatter(x[:,0],x[:,1], c = y, alpha = .5)
#Create data labels
plt.title('Random Distribution by Classifier')
plt.xlabel('X1')
plt.ylabel('X2')
legend = ax.legend(*scatter.legend_elements(),
loc="lower left", title="Classes")
ax.add_artist(legend)
<matplotlib.legend.Legend at 0x11d04b490>
(c) Create 3 different data subsets by selecting 100 of the 1000 data points at random three times. For each of these 100-sample datasets, fit three k-Nearest Neighbor classifiers with: $k = \{1, 25, 50\}$. This will result in 9 combinations (3 datasets, with 3 trained classifiers).
# Put X and Y values into data Frame so they can be sampled
xdf = pd.DataFrame(columns = ['x1', 'x2'], data = x)
ydf = pd.DataFrame(columns = ['y'], data = y)
df = xdf.join(ydf, how='outer')
# Create three samples of 100 from the dataframe
sample1 = df.sample(100)
sample2 = df.sample(100)
sample3 = df.sample(100)
# Convert them back to arrays for KNN interpretation
samp1x = sample1[['x1', 'x2']].values
samp1y = sample1[['y']].values
samp2x = sample2[['x1', 'x2']].values
samp2y = sample2[['y']].values
samp3x = sample3[['x1', 'x2']].values
samp3y = sample3[['y']].values
# Create a loop to go through each sample and KNN combo and create a model
samplelist = [sample1, sample2, sample3]
knn_num = [1, 25, 50]
model_list = []
for data in samplelist:
data = data.values
x = data[:,0:2]
y = data[:,2]
for k in knn_num:
model = KNeighborsClassifier(n_neighbors=k)
mod = model.fit(x,y)
model_list.append(mod)
print(model_list)
[KNeighborsClassifier(n_neighbors=1), KNeighborsClassifier(n_neighbors=25), KNeighborsClassifier(n_neighbors=50), KNeighborsClassifier(n_neighbors=1), KNeighborsClassifier(n_neighbors=25), KNeighborsClassifier(n_neighbors=50), KNeighborsClassifier(n_neighbors=1), KNeighborsClassifier(n_neighbors=25), KNeighborsClassifier(n_neighbors=50)]
(d) For each combination of dataset trained classifier, in a 3-by-3 grid, plot the decision boundary (similar in style to Figure 2.15 from Introduction to Statistical Learning). Each column should represent a different value of $k$ and each row should represent a different dataset.
## Create Stepsize and colormap to be used later in the the decision boundry graphing
step_size = 0.02
colors1 = ListedColormap(['cyan', 'black', 'mediumaquamarine'])
# Creating a plot function to loop through to graph each subplot
def plot(df, k, ax):
#Bring in data from respective sampel and format it
data = df.values
x = data[:,0:2]
y = data[:,2]
x1 = data[:,0]
x2 = data[:,1]
y = data[:,2]
#Get the information necessary to plot the decision boundry
x1_min, x1_max = x1.min() - 1, x1.max() + 1
x2_min, x2_max = x1.min() - 1, x2.max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, step_size),
np.arange(x2_min, x2_max, step_size))
#create model based on K-values
model = KNeighborsClassifier(n_neighbors=k)
model.fit(x,y)
#Model Predictions
Z = model.predict(np.c_[xx1.ravel(), xx2.ravel()])
Z = Z.reshape(xx1.shape)
# Plot decision boundry and predictions on the graph
ax.contourf(xx1, xx2, Z, cmap=colors1)
# Add plot formatting, I had to make the font size smaller in order to keep it from overlapping with the ones above and below.
ax.scatter(x1,x2, c = y)
ax.set_xlabel('X Coordinate 1', fontsize = 10)
ax.set_ylabel('X Coordinate 2', fontsize = 10)
ax.set_title(f'Scatter Plot When K ={k}', fontsize = 13)
ax.legend(*scatter.legend_elements(),
loc="lower left", title="Classes")
#ax.legend([0,1], loc = 'upper right', fontsize = 10)
# Create a subplot
fig, ax = plt.subplots(3, 3, figsize = (15,15))
for ind, data in enumerate(samplelist):
for ind2, k in enumerate(knn_num):
plot(data, k, ax[ind, ind2])
(e) What do you notice about the difference between the rows and the columns. Which decision boundaries appear to best separate the two classes of data? Which decision boundaries vary the most as the data change?
The larger that K gets, the more linear the decision boundry becomes. This aligns with expectations because a smaller k has more flexibility, and conforms closer to the particular dataset. This flexibility makes it appear to fit the training data the closest, but often leads to issues with excessively high variance on test data.
For each individual sample, the K = 1 looks like the closest fit. However, the shape of the K = 1 in each sample is highly variable, suggesting that it is overfit to the individual samples. When looking at the data in aggregate I beleive the K = 25 does the best job of fitting to the data.
(f) Explain the bias-variance tradeoff using the example of the plots you made in this exercise.
The bias variance tradeoff the concept that bias and variance are inversely correlated, when one increases the other decreases. However, the increase or decrease is not necessaily porportionate. The is a point when the combined bias and variance is at its lowest, and the best models idealy want to be set to the parameters corresponding to that point.
A model is said to have less bias when it is more flexible. In our example here, when K = 1 there is very little bias. There is, however, an extreme amount of variance. As you can see from the graphs where k = 1 the line is very jagged and branches off in many directions. The true bayes decision boundry is likely much smoother, meaning there will be a large variace created by having such a flexible decision boundry.
Conversely, looking at the graphs where K is equal to 50, the line becomes significantly more linear. This linearity leads to less variance in the error terms, but has significantly more bias. It lacks flexibility to the point it responds very little to data trends and only provices an almost straight line.
In the graphs where k =25, you see a middle ground approach. There is much more shape and less bias to the line than where k = 50,l but it remains much smoother than the decision boundry for k = 1. This balances bias and variance more than the
[20 points] Bias-variance trade-off II: Quantifying the tradeoff. This exercise will explore the impact of the bias-variance tradeoff on classifier performance by looking at classifier decision boundaries.
Here, the value of $k$ determines how flexible our model is.
ANSWER
(a) Using the function created earlier to generate random samples (using the make_moons function), create a new set of 1000 random samples, and call this dataset your test set and the previously created dataset your training set.
# Bring in dataframe from problem 5 for training data
training = df
x_train = xdf[['x1', 'x2']].values
y_train = ydf[['y']].values
# Create test data
x_test, y_test = sklearn.datasets.make_moons(n_samples = 1000, noise = 0.35, shuffle = True)
(b) Train a kNN classifier on your training set for $k = 1,2,...500$. Apply each of these trained classifiers to both your training dataset and your test dataset and plot the classification error (fraction of mislabeled datapoints).
# Create a dataframe to append the knn value and error rate for each step
accuracydf = pd.DataFrame(columns = ['KNN', '%Error'])
#step through each k value 1 through 500 and fit the KNN model, make predictions, and compare to the true answers
for i in range(1,501):
model = KNeighborsClassifier(n_neighbors = i)
model.fit(x_train, y_train.ravel())
predictions = model.predict(x_test)
errors = np.abs(predictions - y_test)
error_rate = sum(errors)/len(y_test)
accuracydf = accuracydf.append({'KNN':i, '%Error':error_rate}, ignore_index=True)
#This loop does the same thing but on the test data.
testdf = pd.DataFrame(columns = ['KNN', '%Error'])
for i in range(1,501):
tmodel = KNeighborsClassifier(n_neighbors = i)
tmodel.fit(x_test, y_test.ravel())
predictions = tmodel.predict(x_test)
errors = np.abs(predictions - y_test)
error_rate = sum(errors)/len(y_test)
testdf = testdf.append({'KNN':i, '%Error':error_rate}, ignore_index=True)
# Create plots for the error rate for both test and training data
plt.plot(accuracydf['KNN'], accuracydf['%Error'], color='r', linewidth=1.0, alpha=0.5)
plt.xlabel('K Nearest Neighbors')
plt.ylabel('Error Rate')
plt.title('KNN vs. Error Rate')
plt.plot(testdf['KNN'], testdf['%Error'], color='b', linewidth=1.0, alpha=0.5)
plt.show()
(c) What trend do you see in the results?
The Error rate plummets at the beggining from the small K's and generally gets smaller until around K = 100 were it begins to steadily increase. The error rate is at its lowest and fairly level from around K = 60 to k = 100, and is at its highest when K = 1, closely followed by k = 500.
The error rate for the test set behaves differently. There is zero error whe k = 1, because the model is tested and trained on the same data. More flexible models overfit to the data, and if this exact same data is the test data it will perform perfectly even though it is unlikely to be effective on a new set of data.
(d) What values of $k$ represent high bias and which represent high variance?
K = 1 represents high variance, and K = 500 represents high bias.
(e) What is the optimal value of $k$ and why?
accuracydf.loc[accuracydf['%Error'] == accuracydf['%Error'].min()]
| KNN | %Error | |
|---|---|---|
| 66 | 67.0 | 0.102 |
| 68 | 69.0 | 0.102 |
| 75 | 76.0 | 0.102 |
The optimal value of K is 13 for this data set. This has the lowest error rate in this dataset at 0.105.
(f) In kNN classifiers, the value of k controls the flexibility of the model - what controls the flexibility of other models?
In other models - more specifically linear regression, the flexibility of a model is determined by degrees of freedom. The degrees of freedom incease as the number of parameters in a model increase. The parameters in this context are the number of additional independent variables in the model.
[20 points] Linear regression and nonlinear transformations. You're given training and testing data contained in files "A2_Q7_train.csv" and "A2_Q7_test.csv" in the "data" folder for this assignment. Your goal is to develop a regression algorithm from the training data that performs well on the test data.
Hint: Use the scikit learn LinearRegression module.
ANSWER
(a) Create a scatter plot of your training data.
# Import Data, drop extraneous first column
test = pd.read_csv('Data/A2_Q7_test.csv')
train = pd.read_csv('Data/A2_Q7_train.csv')
test.drop(['Unnamed: 0'], axis=1)
train.drop(['Unnamed: 0'], axis=1)
# Turn individual columns into arrays
x = train['x'].values
y = train['y'].values
x_test = test['x'].values
y_test = test['y'].values
# Create the scatterplot, add title and legend
fig, ax = plt.subplots()
scatter = plt.scatter(x, y, alpha = .5)
plt.title('Training Data Distribution')
plt.xlabel('Coordinate X1')
plt.ylabel('Coordinate X2')
Text(0, 0.5, 'Coordinate X2')
(b) Estimate a linear regression model ($y = a_0 + a_1 x$) for the training data and calculate both the $R^2$ value and mean square error for the fit of that model for the training data. Also provide the equation representing the estimated model (e.g. $y = a_0 + a_1 x$, but with the estimated coefficients inserted. Consider this your baseline model against which you will compare other model options.
train = train.sort_values(by='x')
#Reshape data
x = x.reshape(-1,1)
# Fit and create model
reg = LinearRegression()
reg.fit(x,y)
pred = reg.predict(x)
# Store the Intercept and Coefficient
intercept = round(reg.intercept_,2)
beta = round(reg.coef_[0],2)
# store the R-squared and Mean square error terms
mse = metrics.mean_squared_error(y, pred)
r_squared = metrics.r2_score(y, pred)
print(f'The output of the model follows the equation: y = {intercept} + {beta}x')
print('r2: ', round(r_squared,2))
print('MSE: ', round(mse,2))
The output of the model follows the equation: y = 17.2 + 2.59x r2: 0.06 MSE: 791.42
(c) If features can be nonlinearly transformed, a linear model may incorporate those non-linear feature transformation relationships in the training process. From looking at the scatter plot of the training data, choose a transformation of the predictor variable, $x$ that may make sense for these data. This will be a multiple regression model of the form $y = a_0 + a_1 x_1 + a_2 x_2 + \ldots + a_n x_n$. Here $x_i$ could be any transformations of x - perhaps it's $\frac{1}{x}$, $log(x)$, $sin(x)$, $x^k$ (where $k$ is any power of your choosing). Provide the estimated equation for this multiple regression model (e.g. if you chose your predictors to be $x_1 = x$ and $x_2 = log(x)$, your model would be of the form $y = a_0 + a_1 x + a_2 log(x)$. Also provide the $R^2$ and mean square error of the fit for the training data.
# Creat X transformations
ctrain = train
x = ctrain['x']
sinx = np.sin(ctrain['x'].values)
cubex = ctrain['x'].values ** 3
ctest = test
x_test = ctest['x']
sinx_test = np.sin(ctest['x'].values)
cubex_test = ctest['x'].values ** 3
#Reformat so they can be used in linreg model
newdf = pd.DataFrame()
newdf['x'] = x
newdf['x2'] = sinx
newdf['x3'] = cubex
testdf = pd.DataFrame()
testdf['x'] = x_test
testdf['x2'] = sinx_test
testdf['x3'] = cubex_test
creg = LinearRegression()
creg.fit(newdf, y)
cpred = creg.predict(newdf)
# Store the Intercept and Coefficient
cintercept = round(creg.intercept_,2)
beta1, beta2, beta3 = creg.coef_
beta1 = round(beta1,2)
beta2 = round(beta2,2)
beta3 = round(beta3,2)
# store the R-squared and Mean square error terms
mse = metrics.mean_squared_error(y, cpred)
r_squared = metrics.r2_score(y, cpred)
print(f'The output of the model follows the equation: y = {intercept} + {beta1}x + {beta2}sin(x) + {beta3}x^3')
print('r2: ', round(r_squared,2))
print('MSE: ', round(mse,2))
The output of the model follows the equation: y = 17.2 + -1.38x + 5.35sin(x) + 0.08x^3 r2: 0.01 MSE: 838.56
(d) Using both of the models you created here in (b) and (c), plot the original data (as a scatter plot) and the two curves representing your models (each as a separate line).
graphx = x.values.reshape(-1,1)
#x = x.reshape(-1,1)
# Create the scatterplot, add title and legend
fig, ax = plt.subplots()
scatter2 = plt.scatter(x, y, alpha = .5)
plt.title('Regression Model Comparison Chart')
plt.xlabel('Coordinate X1')
plt.ylabel('Coordinate X2')
plt.plot(graphx, reg.predict(graphx), color = "green")
plt.plot(graphx, creg.predict(newdf), color = "red")
[<matplotlib.lines.Line2D at 0x11daff340>]
(e) Using the models above, apply them to the test data and estimate the $R^2$ and mean square error of the test dataset.
### Model 1
x = x.values.reshape(-1,1)
x_test = x_test.values.reshape(-1,1)
# Fit and create model
reg = LinearRegression()
reg.fit(x,y)
pred = reg.predict(x_test)
# Store the Intercept and Coefficient
intercept = round(reg.intercept_,2)
beta = round(reg.coef_[0],2)
# store the R-squared and Mean square error terms
mse = metrics.mean_squared_error(y_test, pred)
r_squared = metrics.r2_score(y_test, pred)
print(f'The output of the model follows the equation: y = {intercept} + {beta}x')
print('r2: ', round(r_squared,2))
print('MSE: ', round(mse,2))
The output of the model follows the equation: y = 18.35 + 0.09x r2: -0.03 MSE: 1014.65
### Model 2
creg = LinearRegression()
creg.fit(newdf, y)
cpred = creg.predict(testdf)
# Store the Intercept and Coefficient
intercept = round(creg.intercept_,2)
beta = round(creg.coef_[0],2)
# store the R-squared and Mean square error terms
mse = metrics.mean_squared_error(y_test, cpred)
r_squared = metrics.r2_score(y_test, cpred)
print(f'The output of the model follows the equation: y = {intercept} + {beta}x')
print('r2: ', round(r_squared,2))
print('MSE: ', round(mse,2))
The output of the model follows the equation: y = 18.27 + -1.38x r2: -0.03 MSE: 1012.53
(f) Which models perform better on the training data, and which on the test data? Why?
The transformed model worked significantly better on the training data, and slightly better on the test data. The model performed significantly better on the training data because it had more flexibility. In a regression model, the more degrees of freedom a model has the more flexible it becomes. Generally, the more flexible a model is the better it performs on training data.
For the test data, the two models were almost identicle with the transformed data performing slightly better in terms of MSE. The linear model performs poorly because the data does not appear to be linear but the model is. The transformed model is non linear and thus has a slight advantage in this regaurd.
(g) Imagine that the test data were significantly different from the training dataset. How might this affect the predictive capability of your model? Why?
If the test data is significantly different from the training data, the model is likely to perform very poorly on it. The model is created to predict data that follows a similar distribution as the training. If the test data does not look like this training data, than the model is far less effective. The training and test data must be similar for the model to be effective.