Machine Learning

Formulas

Linear Regression

Hypothesis Function


(1)
\begin{align} h_\theta (x)=\theta^Tx=\theta _0x_0 + \theta _1 x_1 ... + \theta _nx_n \\ \end{align}

Cost Function


(2)
\begin{align} J(\theta _0, \theta _1 x ... , \theta _n x) = \frac{1}{2m} \displaystyle\sum\limits_{i=1}^m (h_\theta (x^{(i)} - y^{(i)})^2 \end{align}

Normal equation

(3)
\begin{align} \theta=(X^TX)^{-1} X^T y \end{align}

Logistic regression

Hypothesis function


(4)
\begin{align} h_\theta(x)=\frac{1}{1+e^{-\theta^T x}} \end{align}

Cost Function


(5)
\begin{align} Cost(h_\theta(x),y)= \begin{cases} -log (h_\theta(x))& \text{if } y = 1 \\ -log (1- h_\theta(x))& \text{if } y = 0 \end{cases} \end{align}

That is to say…


(6)
\begin{align} J(\theta)= - \frac{1}{m} \left[\displaystyle\sum\limits_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1 - y^{(i)}) \log (1 - h_\theta (x^{(i)})) \right] + \frac{\lambda}{2m}\displaystyle\sum\limits_{j=1}^n \theta_j^2 \end{align}

Implementation notes

  • It is easy to forget that the cost portion of this equation is batched over the entire X and $\theta$; only the regularized portion of the equation is treated as a special case when computing the first vs rest part of the gradient value.

Neural Networks

Logistic sigmoid activation function


(7)
\begin{align} g(z)=\frac{1}{1+e^{-z}} \end{align}

Hypothesis calculation


(8)
\begin{align} z^{(2)} = \Theta ^{(1)}a^{(1)} \\ a^{(2)} = g(z^{(2)}) \\ Add~a_0^{(2)} = 1 \\ z^{(3)} = \Theta^{(2)}a^{(2)}\\ h_\Theta(x) = a^{(3)} = g(z^{(3)}) \ \end{align}

Cost function

The cost function for neural networks is a generalized version of the cost function for logistic regression cost function 6.

(9)
\begin{gather} J(\theta)= - \frac{1}{m} \left[\displaystyle\sum\limits_{i=1}^m \displaystyle\sum\limits_{k=1}^K y_k^{(i)} \log (h_\theta (x^{(i)})_k) + (1 - y_k^{(i)}) \log (1 - (h_\theta (x^{(i)}))_k) \right] + \frac{\lambda}{2m} \displaystyle\sum\limits_{l=1}^{L-1} \displaystyle\sum\limits_{i=1}^{s_l} \displaystyle\sum\limits_{j=1}^{s_l + 1} (\Theta_{ji}^{(l)})^2\\ where~m=number~of~training~examples,\\ L = total~number~of~layers,\\ s_l=number~of~units~in~layer~l,\\ K=output~units\\ \end{gather}

Back propagation

Given that the network has 4 layers (L = 4):

(10)
\begin{gather} \delta_j^{(4)}=a_j^{(4)} - y_j \\ \delta^{(3)}=(\Theta^{(3)})^T\delta^{(4)} ~.*~ g'(z^{(3)})\\ \delta^{(2)}=(\Theta^{(2)})^T\delta^{(3)} ~.* ~g'(z^{(2)})\\ where~the~derivative~term~g'(z^{(l)})=a^{(l)}~.*~(1-a^{(l)})\\ \end{gather}

Support Vector Machines

Kernel similarity function for Gaussian kernel

(11)
\begin{align} exp\left(-\frac{\|x-l^{(i)}\|}{2\sigma^2}\right) \end{align}

Principle Component Analysis

Average Square Projection Error

(12)
\begin{align} \frac{1}{m}\displaystyle\sum\limits_{i=1}^{m}\| x^{(i)} - x^{(i)}_{approx}\|^2 \end{align}

Total variation in the data

(13)
\begin{align} \frac{1}{m}\displaystyle\sum\limits_{i=1}^{m}\| x^{(i)}\|^2 \end{align}

Process

(14)
\begin{align} \Sigma=\frac{1}{m}\displaystyle\sum\limits_{i=1}^{m}(x^{(i)})(x^{(i)})^T \end{align}

Compute eigenvectors: $U=svd(\Sigma)~~OR~~U=eig(\Sigma)$

(15)
\begin{align} U_{reduce} = \begin{bmatrix} | & | & | & & | \\ u_1 & u_2 & u_3 &\cdots & u_k \\ | & | & | & & | \\ \end{bmatrix} \end{align}

Where $k$ is chosen so that the following equation is true:

(16)
\begin{align} \frac{\frac{1}{m}\displaystyle\sum\limits_{i=1}^{m}\| x^{(i)} - x^{(i)}_{approx}\|^2} {\frac{1}{m}\displaystyle\sum\limits_{i=1}^{m}\| x^{(i)}\|^2} \leq .01 \end{align}

if it's desired that 99% of the variance is retained, 95% is also acceptable in certain applications.

Anomaly Detection

(17)
\begin{align} \mu_j=\frac{1}{m}\displaystyle\sum\limits_{i=1}^{m} x^{(i)}_j \end{align}
(18)
\begin{align} \sigma^2_j=\frac{1}{m}\displaystyle\sum\limits_{i=1}^{m} (x^{(i)}_j-\mu _j)^2 \end{align}
(19)
\begin{align} p(x) = \prod _{j=1}^n p(x_j; \mu _j, \sigma _j^2)= \prod _{j=1}^n \frac{1}{\sqrt{2\pi}\sigma_j}exp\left( -\frac{(x_j - \mu_j)^2}{2\sigma _j ^2} \right) \end{align}

It is considered an anomaly if $p(x) < \varepsilon$


Notes

Linear Regression

Logistic Regression

For logistic regression the cost function produces a non-convex function when graphed which prevents an easy convergence to a global minimum using gradient descent for $J(\theta)$.

Neural Networks

If network has $s_j$ units in layer $j, s_j + 1$ units in layer $j + 1$ then $\theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j + 1)$

Unless otherwise stated, the content of this page is licensed under GNU Free Documentation License.