# Linear regression

Linear regression is the approximation of many data points using a single linear function of one or more variables.

## Mean squared error

Suppose we are given a multiset of dependent variable values $Y = \{y_1, y_2, \dots , y_n\}$ and a multiset of corresponding independent variable values $X = \{x_1, x_2, \dots , x_n\}$. We want to create a function $f(x)$ that predicts $y$ with as little overall error as possible. We can quantify the error using mean squared error, defined by $$\mathrm{MSE} = \frac{\sum_{i=1}^{n} (y_i - f(x_i))^2}{n}.$$ Sometimes $f(x_i)$ is notated $\hat{y_i}$, because $f(x_i)$ is a prediction of $y_i$.

Note the similarity of $\mathrm{MSE}$ to the distance formula $d$ in Euclidean space; in fact, $\mathrm{MSE} = \frac{d^2}{n}$, so $\mathrm{MSE}$ increases monotonically with $d$ (which is always nonnegative). Thus, minimizing $\mathrm{MSE}$ corresponds to minimizing Euclidean distance between a point whose coordinates are the $y_i$ and a point whose coordinates are the $f(x_i)$.

If $f(x)$ is a constant function equal to the arithmetic mean of $Y$, then the $\mathrm{MSE}$ equals the variance of $Y$.

### Vector-valued functions

Sometimes multiple values are to be predicted in conjunction (for example, in a weather forecast, the wind components in both the north and east directions). In this case the $y_i$ are represented by vectors, so the predictor function $f(x)$ should also be a vector-valued function. The $\mathrm{MSE}$ formula is altered slightly to include magnitudes: $$\mathrm{MSE} = \frac{\sum_{i=1}^{n} \lVert \mathbf{y}_i - \mathbf{f}(x_i) \rVert ^2}{n}.$$ The summands in the numerator, by the vector magnitude formula, are themselves the sum of squares of differences between components.

### Multiple regression

If there are multiple independent variables in conjunction (for example, if a student's past three AMC scores are used to predict the next score), then the regression becomes a multiple regression. Each $x_i$ must then be viewed as a sequence $x_{i1}, x_{i2}, \dots, x_{im}$, where $m$ is the number of predictors. The terms of this sequence are passed one by one into $f$ as arguments; $f$ is therefore a function of $m$ variables.