## Least Squares

### Least Squares¶

We now turn to the conditional expectation $E(Y \mid X)$ viewed as an estimate or predictor of $Y$ given the value of $X$. As you saw in Data 8, the *mean squared error* of prediction can be used to compare predictors: those with small mean squared errors are better.

In this section we will identify *least squares predictors*, that is, predictors that minimize mean squared error among all predictors in a specified class.

### Minimizing the MSE¶

Suppose you are trying to estimate or predict the value of $Y$ based on the value of $X$. The predictor $E(Y \mid X) = b(X)$ seems to be a good one to use, based on the scatter plots we examined in the previous section.

It turns out that $b(X)$ is the *best* predictor of $Y$ based on $X$, according the least squares criterion.

Let $h(X)$ be any function of $X$, and consider using $h(X)$ to predict $Y$. Define the *mean squared error of the predictor $h(X)$* to be

We will now show that $b(X)$ is the best predictor of $Y$ based on $X$, in the sense that it minimizes this mean squared error over all functions $h(X)$.

To do so, we will use a fact we proved in the previous section:

- If $g(X)$ is any function of $X$ then $E\big{(}(Y - b(X))g(X)\big{)} = 0$.

### Least Squares Predictor¶

The calculations in this section include much of the theory behind *least squares prediction* familiar to you from Data 8. The result above shows that the least squares predictor of $Y$ based on $X$ is the conditional expectation $b(X) = E(Y \mid X)$.

In terms of the scatter diagram of observed values of $X$ and $Y$, the result is saying that the best predictor of $Y$ given $X$, by the criterion of smallest mean squared error, is the average of the vertical strip at the given value of $X$.

Given $X$, the root mean squared error of this estimate is the *SD of the strip*, that is, the conditional SD of $Y$ given $X$:

This is a random variable; its value is determined by the variation within the strip at the given value of $X$.

Overall across the entire scatter diagram, the root mean squared error of the estimate $E(Y \mid X)$ is

$$ RMSE(b) ~ = ~ \sqrt{E\Big{(}\big{(}Y - b(X)\big{)}^2\Big{)}} ~ = ~ \sqrt{E\big{(} Var(Y \mid X) \big{)}} $$Notice that the result makes no assumption about the joint distribution of $X$ and $Y$. The scatter diagram of the generated $(X, Y)$ points can have any arbitrary shape. So the result can be impractical, as there isn't always a recognizable functional form for $E(Y \mid X)$.

Sometimes we want to restrict our attention to a class of predictor functions of a specified type, and find the best one among those. The most important example of such a class is the set of all linear functions $aX + b$.

### Least Squares Linear Predictor¶

Let $h(X) = aX + b$ for constants $a$ and $b$, and let $MSE(a, b)$ denote $MSE(h)$.

$$ MSE(a, b) ~ = ~ E\big{(} (Y - (aX + b))^2 \big{)} ~ = ~ E(Y^2) + a^2E(X^2) + b^2 -2aE(XY) - 2bE(Y) + 2abE(X) $$To find the *least squares linear predictor*, we have to minimize this MSE over all $a$ and $b$. We will do this using calculus, in two steps:

- Fix $a$ and find the value $b_a^*$ that minimizes $MSE(a, b)$ for that fixed value of $a$.
- Then plug in the minimizing value $b_a^*$ in place of $b$ and minimize $MSE(a, b_a^*)$ with respect to $a$.

#### Step 1.¶

Fix $a$ and minimize $MSE(a, b)$ with respect to $b$.

$$ \frac{d}{db} MSE(a, b) ~ = ~ 2b - 2E(Y) + 2aE(X) $$Set this equal to 0 and solve to see that the minimizing value of $b$ for the fixed value of $a$ is

$$ b_a^* ~ = ~ E(Y) - aE(X) $$#### Step 2.¶

Now we have to minimize the following function with respect to $a$:

\begin{align*} E\big{(} (Y - (aX + b_a^*))^2 \big{)} ~ &= ~ E\big{(} (Y - (aX + E(Y) - aE(X)))^2 \big{)} \\ &= ~ E\Big{(} \big{(} (Y - E(Y)) - a(X - E(X))\big{)}^2 \Big{)} \\ &= ~ E\big{(} (Y - E(Y))^2 \big{)} - 2aE\big{(} (Y - E(Y))(X - E(X)) \big{)} + a^2E\big{(} (X - E(X))^2 \big{)} \\ &= ~ Var(Y) - 2aCov(X, Y) + a^2Var(X) \end{align*}The derivative with respect to $a$ is $2Cov(X, Y) + 2aVar(X)$. Thus the minimizing value of $a$ is

$$ a^* ~ = ~ \frac{Cov(X, Y)}{Var(X)} $$### Slope and Intercept of the Regression Line¶

The least squares straight line is called the *regression line*.You now have a proof of its equation, familiar to you from Data 8. The slope and intercept are given by

To derive the second expression for the slope, recall that in exercises you defined the *correlation* between $X$ and $Y$ to be

#### Regression in Standard Units¶

If both $X$ and $Y$ are measured in standard units, then the slope of the regression line is the correlation $r(X, Y)$ and the intercept is 0.

In other words, given that $X = x$ standard units, the predicted value of $Y$ is $r(X, Y)x$ standard units. When $r(X, Y)$ is positive but not 1, this result is called the *regression effect*: the predicted value of $Y$ is closer to 0 than the given value of $X$.

It is important to note that the equation of the regression line holds regardless of the shape of the joint distribution of $X$ and $Y$. Also note that there is always a best straight line predictor among all straight lines, regardless of the relation between $X$ and $Y$. If the relation isn't roughly linear you won't want to use the best straight line for predictions, because the best straight line is best among a bad class of predictors. But it exists.