Welcome to the third installment of our post series on linear regression…our way!!

Let’s start by recapping what we already discussed:

In the first post, we explained how to define linear regression as a supervised learner: Let $\mathfrak{X}$ be a set of features and $\mathfrak{y}$ a finite dimensional inner product space. Pick $\mathfrak{H} \subset \text{Hom}_{\mathbb{R}}(\mathfrak{X},\mathfrak{y})$ to be any finite dimensional subspace of functions as well as a collection $\mathfrak{D}$ of finite datasets $\Delta \subset \mathfrak{X}\times \mathfrak{y}$ that *separate* $\mathfrak{H}$ in the sense that any $h \in \mathfrak{H}$ is uniquely determined by its restriction on $\Delta$.

Finally, define a cost function as follows
$$
c: \mathfrak{D}\times \mathfrak{H}:(\Delta, f)\mapsto \vert \vert (f(x)-y)_{(x,y)\in\Delta}\vert \vert_{\mathfrak{y}^\Delta}=\sqrt{\sum_{(x,y)\in \Delta}\vert \vert f(x)-y\vert\vert_{\mathfrak{y}}}
$$

Then this defines a sharp learner in the sense that $$ h_\Delta = \textrm{argmin}_{f \in \mathfrak{H}} c(\Delta, f) $$ Is well defined.

In the second post, we explained how to go about proving this result.

Specifically, we showed how to *construct* this hypothesis $h_\Delta$:

Look at the following map: $$ \text{ev}_{\Delta}: \mathfrak{H}\longrightarrow \mathfrak{y}^\Delta: f\mapsto (f(x))_{x \in \Delta} $$ and consider the image subspace $\text{im}(\text{ev}_\Delta)\subset \mathfrak{y}^\Delta$. Then let $h_\Delta$ is the projection of the labels $(y)_{y \in \Delta}$ onto $\text{im}(\text{ev}_\Delta)$. In the previous post, we denoted this projection by $\text{ev}^+\big( (y)_{y \in \Delta}\big)$.

Today, we explain how this $+$ sign is much more than just a piece of notation. There is in fact some very elegant mathematics lurking in the background in the form of Moore-Penrose pseudo-inverses !

To make the exposition clearer, we will define them in the more general context of linear maps $f:V\longrightarrow W$ between finite dimensional vector spaces. The motivating question here is the simple observation that $f$ need not be invertible, but instead we could ask:

*“inverse”*to $f$ in a natural way ?

the starting point of the answer is the fact that $f$ is invertible if and only if 2 things happen: the kernel has to be nonzero and the image has to coincide with $W$. Now, whether or not this is the case, we can always restrict $f$ cleverly so as to avoid those two issues:

In other words, at the very least, the map $f$ indeed does have a (restricted) inverse $f^\sharp: \textrm{im}(f)\longrightarrow U_V$.

Actually it turns out that we can always lift $f^\sharp$ to a map on the whole of $W$ in a unique way!

More precisely, if we let $\pi_{\textrm{im}(f)}: W \longrightarrow \textrm{im}(f)$ be the canonical projection (remember that we decomposed $W=\text{im}(f)\oplus U_W$ above), we have:

To reinterpret the above lemma in a slightly more elegant way however, we'll bundle the components $U_V$ and $U_W$ together and define the set $\Lambda(f)$ as : $$ \{(U_V,U_W) \vert \, \ker(f)\oplus U_V=V \textrm{ and }\textrm{im}(f)\oplus U_W=W \} $$ So that we now have an assignment $$ \Phi: \Lambda(f)\longrightarrow \textrm{Hom}(W,V):(U_V,U_W)\mapsto f^\sharp $$ We can now make the following definition:

Then we call $f^\sharp = \Phi(U_V,U_W)$ the pseudo-inverse to the triple $(U_V,U_W,f)$.

We call $g \in \textrm{Hom}(W,V)$ a pseudo-inverse to $f$ if $g \in \textrm{im}(\Phi)$.

We'll denote the set of pseudo-inverses to $f$ by $\Pi(f)$

1. Can we find properties that characterize the set $\Pi(f)$ of pseudo-inverses to $f$?

2. Is the assignment $\Phi$ one-to-one?

It turns out that both question can be answered postively. The key insight here is to realize that we can retrieve the components $U_V$ and $U_W$ from a pseudo-inverse as follows:

In other words: $$\Pi(f)=\{g\in \textrm{Hom}(W,V)\, \vert (f\circ g)_{\textrm{im}(f)}=id\, (g\circ f)_{\textrm{im}(g)}=id\}$$

This gives another characterization of the pseudo-inverse, namely as a map $\textrm{Hom}(W,V)$ that indeed behaves like an inverse to $f$ on the images of the respective functions.

Summarizing the above theorem, we know that a pseudo-inverse now corresponds to the data of two complementary subspaces $(U_V,U_W)$. Now, we may wonder if there is a canonical choice of such complements…
In general of course the answer is no, but if we assume that $V$ and $W$ carry a little more structure, there is something that can be done…

So, let’s now assume that $V$ and $W$ are finite-dimensional *inner product* spaces. (we’ll denote the inner product by $\langle -,-\rangle$ throughout), then we know that for any subspace $U\subset V$, we always have $U\oplus U^\perp =V$. We can thus define:

There is of course one last question tht needs to be answered: how do we pick out the Moore penrose inverse from the set of pseudo-inverses $\Pi(f)$?

- $f^+ \in \textrm{Hom}(W,V)$ is the Moore-Penrose pseudo-inverse to $f$
- $(f\circ g)_{\textrm{im}(f)}=\textrm{Id}\, (g\circ f)_{\textrm{im}(g)}=\textrm{Id}$ and $(f\circ g)$ and $g\circ f$ are self-adjoint linear maps

The above four properties are what Wikipedia takes as the definition of Moore-Penrose pseudo-inverse, but it turns out that there as a much more natural way of interpreting them!