R Tutorial for COMP6235 -- self study/practice session 3

Some suggested solutions for the problems posed here Exercises for part III.

Exercises III

I first build the full model

> mfull <- lm (price ~ model + mileage + color + transmission)
> summary(m0)

In the summary I notice that the regression coefficients for model and mileage certainly are significant, the coefficients for color and transmission are in strong doubt. I thus reduce model complexity and explore the following other models, systematically leaving out predictors

> mmmc <- lm (price ~ model + mileage + color)
> mmmt <- lm (price ~ model + mileage + transmission)
> mmm <- lm (price ~ model + mileage)
> mm0 <- lm (price ~ model)
> mm1 <- lm (price ~ mileage)

The AIC comparison gives me the following scores:

> AIC (mfull, mmmc, mmmt, mmm, mm0, mm1)

df AIC
mfull 14 2670.056
mmmc 13 2672.206
mmmt 6 2665.508
mmm 5 2665.918
mm0 4 2814.940
mm1 3 2687.173

I thus see that my best candidate is the model "mmmt" with very little difference to "mmm". Because of the very small difference in AIC scores between "mmmt" and "mmm" and since the regression coefficient for transmission is not significant in the model "mmmt", I would prefer the more parsimonious model "mmm", i.e. conclude that only the type of model and mileage driven have significant effects on price.

This analysis up to this point, however, has to be taken with some caution. After all, we have not analyzed any interaction effects ... which goes a bit beyond this tutorial.
The idea here is to generate two independently drawn vectors of 100 normally distributed random variates from N(0,1). Let us call them X and Y. If I set Z=0.5*X, then I will, of course, expect that linear regression Z ~ X will return a slope of 0.5. In this case, however, 100% of the variance of Z is explained by X. So, how to construct Z such that less variance is explained? An idea is to set Z=0.5*X+alpha*Y, i.e. I add randomness from Y to X with "strength" alpha and as X and Y are independent, there will thus then be a portion of the variation in Z which X cannot explain. The question is what value of alpha do I need? I could actually calculate this, but here we approach it numerically. I build an R function that constructs Z=0.5*X+alpha*Y for different alpha, carries out the linear regression, and stores the r-squared values:

reg <- function () {
set.seed(0)
rs <- double(100)
x <- rnorm (100, 0,1)
y <- rnorm (100, 0, 1)
for (i in 0:100) {
z <- 0.5 * x + y*i/100.
m <- lm (z ~ x)
rs[i]=summary(m)$r.squared
}
print (rs)
}

The maybe only interesting point in here is how you extract values from the regression results in line 6 via "summary(m)$r.squared". The function then reports all of the r-squareds as a function of the index (which in turn is a function of alpha, i.e. alpha=index/100). Note that I also set the seed of the random number generator to 0, results could otherwise differ quite a bit since our sample sizes are very small. Reading the results, in my case I find alpha around 0.53 gives an r-squared value of roughly 0.5.

Let's test if we were indeed successful with our experiment:

> set.seed(0)
> x <- rnorm (100, 0,1)
> y <- rnorm (100, 0, 1)
> z <- 0.5 * x + y*0.53
> m <- lm (z ~ x)
> summary(m)

And the summary should give you an r-squared of close to 0.5 now. However: note that setting the seed appropriately is essential now (otherwise you get different vectors of random variates X and Y).
To approach this problem, I first coded a function that calculates the sd of a sample of size n of normal variates. The function returns the average standard deviation, calculated based on an average of 10000 runs. My code reads:

sdn <- function (n) {
s=0.
for (i in 1:10000) {
x <-rnorm (n,0,1)
s=s+sd(x)
}
s=s/10000.;
return(s)
}

I then initialize a vector of doubles, say of size 20, because I want to plot the dependence of sd's on sample size for sample sizes 1-20:

> x <- double(20)

Next, I set the values of x

> for (i in 1:20) { x[i] <- sdn(i) }

This may take a while (and if your computer is very slow you might want to reduce 10.000 in the function sdn to a lower number). I can now plot the dependence of sd's on the sample size

> plot (x)

and obtain a figure similar to the following:

You will notice that in the function sdn I have drawn random n variates from a normal distribution with mean zero and standard deviation 1, hence I would expect to find that my estimator will also (on average) return standard deviation one. As you see in the figure, without Bessel's correction it generally doesn't, but as discussed in the last lecture, it underestimates the standard deviation. You also see in the figure that the amount of underestimation vanishes when the sample size becomes larger. For large sample sizes my estimate actually converges to 1, the true standard deviation of the underlying distribution I have sampled from.
I constructed a solution using the rep () function to generate vectors composed of repeated 0's and a one at certain places. The function below consists of a for loop, then calculates the position of the "1" based on a sine pattern -- suitably scaled to fit within 40 character screen widths to avoid line breaks. I then construct a vector of 0's and a one at position p and print this vector. You can start the function by calling "sinwave ()" from the terminal.

sinwave <- function(x) {
for (i in 1:1000) {
p <- as.integer(15+14.*sin(pi*i/18.))
x <- c(rep(0,p-1),1,rep(0,30-p))
print(x)
}
}