Conceptual

1.

Let $A = e^{\beta_0 + \beta_1X}$ and $P = p(X)$.

2.

Observe that the denominator is a constant, it’s not dependent on the index k in any way. Furthermore $\pi_l$, $\sigma$ and $\exp(z)$ are always $> 0$. Hence it’s a positive constant which we’ll denote as $c$ onwards.

Now notice that there are two terms in this expression that are not reliant on k, $log(\sqrt{2\pi}\sigma)$ and $\frac{x^2}{2\sigma^2}$. These can then be deemed constants and easily lumped in with a new constant term we’ll call $log(c’)$, an important note to make is that $\sqrt{2\pi}\sigma$ and $\exp(\frac{x^2}{2\sigma^2})$ are both positive terms so $c’$ is still a strictly positive constant.

Where $c’$ is strictly positive and the log function is monotonically increasing, hence maximising $\delta_k(x)$ maximises $p_k(x)$ as required.

3.

So this question is very wordy but really easy to answer given the proof we did in the previous question. Recall that we only eliminated the quadratic term $\frac{x^2}{2\sigma^2}$ because it had no reliance on k. But in the case of QDA the covariance matrix (which simplifies down to variance for a single predictor) is class specific, thus the terms will be $\frac{x^2}{2\sigma_k^2}$ and it cannot be eliminated as a constant. So the new discriminant will be a quadratic in terms of $x$.

4.

a)

10%, ignoring the edge cases at $X < 0.05$ and $ X > 0.95$.

b)

1%

c)

$0.1^{100}$

d)

We can see that when p is large and n is relatively small we’re only using an extremely small subset of overall data to determine the classification of an observation. Or more accurately as in KNN a fixed amount of points are always used to determine classification, these points will not lie ‘closely’ to the observation and hence cause a poor fit.

e)

Shortcut, the side of the hypercube will always be equal to $s = 0.1^{1/p}$.

So assuming a fixed range of [0, 1] for each predictor and a uniform distribution means points are distributed evenly over the hypervolume. We want a hypervolume that corresponds to 10% of the total hypervolume of the cube which will always be 1, thus $V_{class} = S_{class}^p = 0.1$.

5.

a)

Expect QDA to perform better on the training data as it will overfit, expect LDA to perform better on the test data as it represents the underlying truth more accurately.

b)

Expect QDA to perform better on both.

c)

QDA performance will relatively improve with more observations, as is the general case for models with higher flexibility.

d)

False. QDA will model the noise present in the system given the extra quadratic term and thus have a higher test error rate.

6.

a)

b)

Take the log of both sides.

7.

Let’s use LDA.

Then:

8.

KNN (n=1) training error rate is always 0% by definition, which means the test error rate has to be 36% for the average to be 18%. Thus you should use logistic regression as it has a lower test error rate of 30%.

9.

a)

b)



Applied

10.

library(ISLR)
attach(Weekly)

a)

names(Weekly)
## [1] "Year"      "Lag1"      "Lag2"      "Lag3"      "Lag4"      "Lag5"     
## [7] "Volume"    "Today"     "Direction"
dim(Weekly)
## [1] 1089    9
corrplot::corrplot(cor(Weekly[,-9]), method="circle")

Just like the smarket data there seems to be no correlation between any predictors except volume and year which is expected, volume should be growing over time.

plot(Volume)

b)

glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=Weekly, family=binomial)
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Lag2 has statistical significance.

c)

glm.prob = predict(glm.fit, type="response")
glm.pred = rep("Down", length(Direction))
glm.pred[glm.prob > 0.5] = "Up"
table(glm.pred, Direction)
##         Direction
## glm.pred Down  Up
##     Down   54  48
##     Up    430 557
mean(glm.pred == Direction)
## [1] 0.5610652

Overall prediction rate is 56.1%, the majority of the errors are in the false positive space. i.e. In this case meaning the model is bad at predicting when the market will go down, doing so at only 54/(430+54) = 11.1% accuracy.

d)

train = Year <= 2008
test = !train
glm.fit = glm(Direction ~ Lag2, data=Weekly, family=binomial, subset=train)
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag2, family = binomial, data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.536  -1.264   1.021   1.091   1.368  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.20326    0.06428   3.162  0.00157 **
## Lag2         0.05810    0.02870   2.024  0.04298 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.5  on 983  degrees of freedom
## AIC: 1354.5
## 
## Number of Fisher Scoring iterations: 4
glm.prob = predict(glm.fit, Weekly[test, ], type="response")
glm.pred = rep("Down", length(Direction[test]))
glm.pred[glm.prob > 0.5] = "Up"
table(glm.pred, Direction[test])
##         
## glm.pred Down Up
##     Down    9  5
##     Up     34 56
mean(glm.pred == Direction[test])
## [1] 0.625

e)

library(MASS) # Needed to load in LDA
lda.fit = lda(Direction ~ Lag2, data=Weekly, subset=train)
lda.pred = predict(lda.fit, Weekly[test,])$class
table(lda.pred, Direction[test])
##         
## lda.pred Down Up
##     Down    9  5
##     Up     34 56
mean(lda.pred == Direction[test])
## [1] 0.625

f)

qda.fit = qda(Direction ~ Lag2, data=Weekly, subset=train)
qda.pred = predict(qda.fit, Weekly[test,])$class
table(qda.pred, Direction[test])
##         
## qda.pred Down Up
##     Down    0  0
##     Up     43 61
mean(qda.pred == Direction[test])
## [1] 0.5865385

g)

library(class)
set.seed(1) # To make this result reproducible
knn.pred = knn(data.frame(Lag2[train]), data.frame(Lag2[test]), Direction[train], k=1)
table(knn.pred, Direction[test])
##         
## knn.pred Down Up
##     Down   21 30
##     Up     22 31
mean(knn.pred == Direction[test])
## [1] 0.5

h)

Logistic regression and LDA give identical results, and are both superior to QDA and KNN (n=1).

i)

Let’s go through some logistic regression models first.

glm.fit = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, data=Weekly, family=binomial, subset=train)
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = Weekly, subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7186  -1.2498   0.9823   1.0841   1.4911  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.33258    0.09421   3.530 0.000415 ***
## Lag1        -0.06231    0.02935  -2.123 0.033762 *  
## Lag2         0.04468    0.02982   1.499 0.134002    
## Lag3        -0.01546    0.02948  -0.524 0.599933    
## Lag4        -0.03111    0.02924  -1.064 0.287241    
## Lag5        -0.03775    0.02924  -1.291 0.196774    
## Volume      -0.08972    0.05410  -1.658 0.097240 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1342.3  on 978  degrees of freedom
## AIC: 1356.3
## 
## Number of Fisher Scoring iterations: 4

On the training set it looks like only Lag1 has statistical significance, not Lag2 so let’s try using that. This isn’t really a good strategy, just seeing that statistical significance of predictors changed just by separating the full data into train and test sets implies there’s no real significance there in the first place.

glm.fit = glm(Direction ~ Lag1, data=Weekly, family=binomial, subset=train)
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1, family = binomial, data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.519  -1.253   1.028   1.094   1.281  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.21829    0.06438   3.391 0.000697 ***
## Lag1        -0.05908    0.02892  -2.043 0.041059 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1350.4  on 983  degrees of freedom
## AIC: 1354.4
## 
## Number of Fisher Scoring iterations: 4
glm.prob = predict(glm.fit, Weekly[test, ], type="response")
glm.pred = rep("Down", length(Direction[test]))
glm.pred[glm.prob > 0.5] = "Up"
table(glm.pred, Direction[test])
##         
## glm.pred Down Up
##     Down    4  6
##     Up     39 55
mean(glm.pred == Direction[test])
## [1] 0.5673077

This actually performs worse than our previous logistic regression model, let’s try an interaction effect between Lag1 and Lag2 for fun.

glm.fit = glm(Direction ~ Lag1*Lag2, data=Weekly, family=binomial, subset=train)
summary(glm.fit)
## 
## Call:
## glm(formula = Direction ~ Lag1 * Lag2, family = binomial, data = Weekly, 
##     subset = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.573  -1.259   1.003   1.086   1.596  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.211419   0.064589   3.273  0.00106 **
## Lag1        -0.051505   0.030727  -1.676  0.09370 . 
## Lag2         0.053471   0.029193   1.832  0.06700 . 
## Lag1:Lag2    0.001921   0.007460   0.257  0.79680   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1354.7  on 984  degrees of freedom
## Residual deviance: 1346.9  on 981  degrees of freedom
## AIC: 1354.9
## 
## Number of Fisher Scoring iterations: 4
glm.prob = predict(glm.fit, Weekly[test, ], type="response")
glm.pred = rep("Down", length(Direction[test]))
glm.pred[glm.prob > 0.5] = "Up"
table(glm.pred, Direction[test])
##         
## glm.pred Down Up
##     Down    7  8
##     Up     36 53
mean(glm.pred == Direction[test])
## [1] 0.5769231

The mixed term has no significance, again the model performs worse than the original. Let’s experiment with different K values for KNN.

kvals = 1:300
error_rates = mapply(function(kval) {
        knn.pred = knn(data.frame(Lag2[train]), data.frame(Lag2[test]), Direction[train], k=kval)
        return(mean(knn.pred != Direction[test]))
    }, kvals)

invk = 1/kvals
plot(invk, error_rates, log="x", xlab="1/k", ylab="Error Rate")

1-error_rates[160]
## [1] 0.625

Optimal k value is approximately 160 taking a look at the general trend and discounting outliers. The prediction rate at this point is 62.5% which is identical to the prediction rate of our original logistic regression and LDA models.

knn.pred = knn(data.frame(Lag2[train]), data.frame(Lag2[test]), Direction[train], k=160)
table(knn.pred, Direction[test])
##         
## knn.pred Down Up
##     Down   10  5
##     Up     33 56

The confusion matrix is also pretty similar to the matrices given by the original logistic regression and LDA models, the model is still quite bad at predicting weeks in which the market goes down but good at predicting when it goes up.

11.

detach(Weekly)
attach(Auto)

a)

mpg01 = rep(0, length(mpg))
mpg01[mpg > median(mpg)] = 1
Auto = data.frame(Auto, mpg01)

b)

So we’ve investigated associations with mpg in previous labs and we expect those associations to carry over to the newly created categorical variable mpg01. We can look at the pairs plot again to quickly spot associations with mpg, we can also create a series of boxplots to show definitively that these predictors have associations with mpg01. I haven’t included every plot for brevity.

pairs(Auto)

auto_normalized = scale(Auto[,-9])
boxplot(displacement ~ mpg01, ylab="Displacement", xlab="mpg01")

boxplot(weight ~ mpg01, ylab="Weight", xlab="mpg01")

boxplot(horsepower ~ mpg01, ylab="Horsepower", xlab="mpg01")

c)

There’s 392 entries, lets use 75% for training and the remaining for test.

sample_size = floor(0.75 * nrow(Auto))
set.seed(1) # so the training data is reproducible
sample_part = sample(seq_len(nrow(Auto)), sample_size)

auto_training = Auto[sample_part, ]
auto_test = Auto[-sample_part, ]
mpg01_train = mpg01[sample_part]
mpg01_test = mpg01[-sample_part]

d)

lda.fit = lda(mpg01 ~ cylinders + displacement + horsepower + weight + year, data=auto_training)
lda.pred = predict(lda.fit, auto_test)$class
table(lda.pred, mpg01_test)
##         mpg01_test
## lda.pred  0  1
##        0 42  0
##        1  4 52
mean(lda.pred != mpg01_test)
## [1] 0.04081633

We see there’s only a 4.1% error rate which is excellent. All the errors are focused in the false positive quadrant which has a class error rate of 4/(4+42) = 8.7%.

e)

qda.fit = qda(mpg01 ~ cylinders + displacement + horsepower + weight + year, data=auto_training)
qda.pred = predict(qda.fit, auto_test)$class
table(qda.pred, mpg01_test)
##         mpg01_test
## qda.pred  0  1
##        0 42  3
##        1  4 49
mean(qda.pred != mpg01_test)
## [1] 0.07142857

QDA performs slightly worse wth a 7.1% error rate.

f)

glm.fit = glm(mpg01 ~ cylinders + displacement + horsepower + weight + year, data=auto_training, family=binomial)
glm.prob = predict(glm.fit, auto_test, type="response")
glm.pred = rep(0, length(mpg01_test))
glm.pred[glm.prob > 0.5] = 1
table(glm.pred, mpg01_test)
##         mpg01_test
## glm.pred  0  1
##        0 40  2
##        1  6 50
mean(glm.pred != mpg01_test)
## [1] 0.08163265

There’s a test error rate of 8.1% for our logistic regression model.

g)

Let’s reuse the script we made in question 10.

train.X = cbind(auto_training$cylinders, auto_training$displacement, auto_training$horsepower, auto_training$weight, auto_training$year)
test.X = cbind(auto_test$cylinders, auto_test$displacement, auto_test$horsepower, auto_test$weight, auto_test$year)

nk = 50
kvals = 1:nk

error_rates = mapply(function(kval) {
        knn.pred = knn(train.X, test.X, mpg01_train, k=kval)
        return(mean(knn.pred != mpg01_test))
    }, kvals)

plot(error_rates, xlab="k", ylab="Error Rate")

error_rates[33]
## [1] 0.07142857

Best performance seems to be around k=33 which corresponds to an error rate of 7.1%, but none of this matches our original LDA model.

12.

a)

Power = function() {
    print(2^3)
}

Power()
## [1] 8

b)

Power2 = function(x, a) {
    print(x^a)
}

Power2(3, 8)
## [1] 6561

c)

Power2(10, 3)
## [1] 1000
Power2(8, 17)
## [1] 2.2518e+15
Power2(131, 3)
## [1] 2248091

d)

Power3 = function(x, a) {
    result = x^a
    return(result)
}

Power3(3, 8)
## [1] 6561

e)

plot(1:10, Power3(1:10, 2), log="xy")

f)

PlotPower = function(x, a) {
    plot(x, Power3(x, a))
}

PlotPower(1:10, 3)

13.

library(MASS)
attach(Boston)
crim01 = rep(0, length(crim))
crim01[crim > median(crim)] = 1

Boston = data.frame(Boston, crim01)
pairs(Boston)

Let’s split the data into training and test sets first.

sample_size = floor(0.75 * nrow(Boston))
set.seed(1) # so the training data is reproducible
training = sample(seq_len(nrow(Boston)), sample_size)
test = setdiff(c(1:nrow(Boston)), training)

I’ve done a few different ways of splitting training and test sets in this exercise just to see what I like best, this is probably my favourite but do whatever works best for you.

glm.fit = glm(crim01 ~ . -crim, data=Boston, family=binomial, subset=training)
summary(glm.fit)
## 
## Call:
## glm(formula = crim01 ~ . - crim, family = binomial, data = Boston, 
##     subset = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7109  -0.1721   0.0001   0.0054   3.5188  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -33.939299   7.488211  -4.532 5.83e-06 ***
## zn           -0.078757   0.038047  -2.070  0.03845 *  
## indus        -0.070021   0.052616  -1.331  0.18326    
## chas          0.401738   0.853233   0.471  0.63775    
## nox          47.398517   8.643372   5.484 4.16e-08 ***
## rm           -0.331606   0.809108  -0.410  0.68192    
## age           0.042799   0.015352   2.788  0.00531 ** 
## dis           0.746777   0.262032   2.850  0.00437 ** 
## rad           0.586171   0.179218   3.271  0.00107 ** 
## tax          -0.006918   0.003313  -2.088  0.03678 *  
## ptratio       0.250405   0.141495   1.770  0.07678 .  
## black        -0.006728   0.005234  -1.285  0.19864    
## lstat        -0.011002   0.059629  -0.185  0.85362    
## medv          0.120250   0.076044   1.581  0.11380    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 525.38  on 378  degrees of freedom
## Residual deviance: 159.39  on 365  degrees of freedom
## AIC: 187.39
## 
## Number of Fisher Scoring iterations: 9

Let’s take zn, nox, age, dis, rad and tax and remove the other predictors from the model.

glm.fit = glm(crim01 ~ zn + nox + age + dis + rad + tax, data=Boston, family=binomial, subset=training)
summary(glm.fit)
## 
## Call:
## glm(formula = crim01 ~ zn + nox + age + dis + rad + tax, family = binomial, 
##     data = Boston, subset = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8364  -0.2071   0.0002   0.0049   3.2311  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -24.288383   4.209607  -5.770 7.94e-09 ***
## zn           -0.074738   0.031375  -2.382  0.01721 *  
## nox          36.799242   6.671607   5.516 3.47e-08 ***
## age           0.032198   0.011138   2.891  0.00384 ** 
## dis           0.488325   0.208636   2.341  0.01925 *  
## rad           0.674803   0.151494   4.454 8.42e-06 ***
## tax          -0.009189   0.002797  -3.285  0.00102 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 525.38  on 378  degrees of freedom
## Residual deviance: 170.59  on 372  degrees of freedom
## AIC: 184.59
## 
## Number of Fisher Scoring iterations: 9

And finally evaluate our test error rate.

glm.prob = predict(glm.fit, Boston[test,], family=binomial, type="response")
glm.pred = rep(0, length(crim01[test]))
glm.pred[glm.prob > 0.5] = 1

table(glm.pred, crim01[test])
##         
## glm.pred  0  1
##        0 57  9
##        1  8 53
mean(glm.pred != crim01[test])
## [1] 0.1338583

Test error rate for logistic regression model here is 13.3%.

Let’s evaluate LDA now.

lda.fit = lda(crim01 ~ zn + nox + age + dis + rad + tax, data=Boston, subset=training)
lda.pred = predict(lda.fit, Boston[test,])$class

table(lda.pred, crim01[test])
##         
## lda.pred  0  1
##        0 62 17
##        1  3 45
mean(lda.pred != crim01[test])
## [1] 0.1574803

Test error rate for LDA model is 15.7%

Onto QDA.

qda.fit = qda(crim01 ~ zn + nox + age + dis + rad + tax, data=Boston, subset=training)
qda.pred = predict(qda.fit, Boston[test,])$class

table(qda.pred, crim01[test])
##         
## qda.pred  0  1
##        0 62 11
##        1  3 51
mean(qda.pred != crim01[test])
## [1] 0.1102362

QDA has the best performance thus far with a test error rate of 11.0%

Finally KNN reusing the script we’ve been using throughout this chapter.

train.X = cbind(zn, nox, age, dis, rad, tax)[training,]
test.X = cbind(zn, nox, age, dis, rad, tax)[test, ]
train.Y = crim01[training]
test.Y = crim01[test]

nk = 100
kvals = 1:nk

error_rates = mapply(function(kval) {
        set.seed(1)
        knn.pred = knn(train.X, test.X, train.Y, k=kval)
        return(mean(knn.pred != test.Y))
    }, kvals)

plot(error_rates, xlab="k", ylab="Error Rate")

1-NN has the best performance with a test error rate of 8.6%. To be honest I’m not entirely sure what to make of this, it could be implying there’s a high degree of non-linearity in the required model which may help explain why QDA performs slightly better than logistic regression or LDA.