Jekyll2020-02-23T00:32:47+00:00http://www.michaelpiseno.com/fractal//fractal/feed.xmlFractalFractal is an organization that provides tutorials for topics related to artifical intelligence. We are also involved with educational outreach related to AI such as giving talks at local schools.Linear Approximation2020-02-21T08:06:43+00:002020-02-21T08:06:43+00:00http://www.michaelpiseno.com/fractal//fractal/mfml/2020/02/21/linear-approximation<p>$\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$</p>
<h3 id="linear-approximation">Linear Approximation</h3>
<p>Linear approximation is a fundamental problem in machine learning, and one that has a surprising amount of mathematical structure built around it for such a seemingly simple problem. Consider the following problem: We have a Hilbert space $\mathbf{S}$ and a subspace $\mathbf{T} \subseteq \mathbf{S}$. We also have an element $\mathbf{x} \in \mathbf{S}$. What is the closest element $\mathbf{\hat{x}} \in \mathbf{T}$ to $\mathbf{x}$?</p>
<center>
<div class="col-lg-6 col-md-6 col-sm-12 col-xs-12">
<img src="/fractal/assets/Linear_Approx/linear-approx-problem.png" />
</div>
</center>
<p>This Hilbert space $\mathbf{S}$ has an inner product $\langle\cdot, \cdot\rangle$ and induced norm $\norm{\cdot}$. So we can frame the problem as finding the point $\mathbf{\hat{x}} \in \mathbf{T}$ such that $\norm{\mathbf{\hat{x} - x}}$ is minimized.</p>
<script type="math/tex; mode=display">\begin{equation} \tag{1}
\text{minimize}_{\mathbf{y\in T}} \norm{\mathbf{y - x}}
\end{equation}</script>
<p>We can find a unique minimizer by exploiting orthogonality. In fact, $\mathbf{\hat{x} \in T}$ is the closest point to $\mathbf{x \in S}$ if $\mathbf{\hat{x} - x}$ is orthogonal to all other points $\mathbf{y \in T}$. This means that $\langle \mathbf{\hat{x} - x}, y\rangle = 0$ for all $\mathbf{y \in T}$</p>
<p>Lets show that if $\langle\mathbf{\hat{x} - x}, \mathbf{y}\rangle = 0$ for all $\mathbf{y \neq \hat{x} \in T}$ then $\mathbf{\hat{x}}$ is minimizer of $(1)$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\norm{\mathbf{x - y}}^{2} &= \norm{(\mathbf{x - \hat{x}}) - (y - \mathbf{\hat{x}})}^{2} \\
&= \norm{\mathbf{x - \hat{x}}}^{2} + \norm{\mathbf{y} - \mathbf{\hat{x}}}^{2}
\end{align*} %]]></script>
<p>The last equality follows from the Pythagorean theorem. This is valid because we required that $\mathbf{x - \hat{x}}$ was orthogonal to all points in $\mathbf{T}$, and $\mathbf{y} - \mathbf{\hat{x}}$ is certainly in $\mathbf{T}$!</p>
<center>
<div class="col-lg-8 col-md-8 col-sm-12 col-xs-12">
<img src="/fractal/assets/Linear_Approx/closest-point.png" />
</div>
</center>
<p>Therefore, if $\norm{\mathbf{y} - \mathbf{\hat{x}}}^{2} \neq 0$ (i.e. $\mathbf{y} \neq \mathbf{\hat{x}}$), then</p>
<script type="math/tex; mode=display">\norm{\mathbf{x} - \mathbf{y}}^{2} > \norm{\mathbf{x - \hat{x}}}^{2}</script>
<p>where equality is achievd only when $\mathbf{y} = \mathbf{\hat{x}}$. This implies that $\mathbf{\hat{x}}$ is a unique minimizer of $(1)$. This is a pretty intuitive result: If $\mathbf{x - y}$ is not orthogonal to $\mathbf{T}$, then there is some other point $\mathbf{\hat{x}}$ that comes closer to $\mathbf{x}$ while still remaining inside $\mathbf{T}$. This can be seen visually in the image above.</p>
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<p><ins class="adsbygoogle" style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-8495937332177101" data-ad-slot="5147890219"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<h4 id="computing-the-closest-point">Computing the closest point</h4>
<p>So we know that $\mathbf{\hat{x}}$ is a unique minimizer of $(1)$ if $\langle\mathbf{x - \hat{x}}, y\rangle = 0$ for all $\mathbf{y} \neq \mathbf{\hat{x}}$ in $\mathbf{T}$, but how do we actually compute $\mathbf{\hat{x}}$? If $\mathbf{T}$ is an $N$-dimensional subspace, that means we can represent any point in the space by a linear combination of $N$ basis vectors - call them $\mathbf{v_{1}}, \mathbf{v_{2}}, …, \mathbf{v_{N}}$.</p>
<script type="math/tex; mode=display">\mathbf{\hat{x}} = \alpha_{1}\mathbf{v_{1}} + \alpha_{2}\mathbf{v_{2}} + ... + \alpha_{N}\mathbf{v_{N}} = \sum_{n=1}^{N}\alpha_{n}\mathbf{v}_{N}</script>
<p>for some constants $\{ \alpha \}_{1}^{N}$. Orthogonality also tells us</p>
<script type="math/tex; mode=display">\langle\mathbf{x - \hat{x}}, \mathbf{v_{k}}\rangle = 0</script>
<p>If we take the inner product of $\mathbf{x - \hat{x}}$ with one of the basis vectors we generate a linear equation.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\langle\mathbf{x - \hat{x}}, \mathbf{v_{k}}\rangle &= \big\langle\mathbf{x} - \sum_{n=1}^{N}\alpha_{n}\mathbf{v_{n}}, \mathbf{v_{k}}\big\rangle \\
&= \langle\mathbf{x}, \mathbf{v_{k}}\rangle - \alpha_{1}\langle\mathbf{v_{1}, \mathbf{v_{k}}}\rangle - ... - \alpha_{N}\langle\mathbf{v_{1}, \mathbf{v_{k}}}\rangle \\
\Rightarrow \langle\mathbf{x}, \mathbf{v_{k}}\rangle &= \alpha_{1}\langle\mathbf{v_{1}, \mathbf{v_{k}}}\rangle + ... + \alpha_{N}\langle\mathbf{v_{1}, \mathbf{v_{k}}}\rangle
\end{align*} %]]></script>
<p>In fact, we can generate $N$ different linear equations by taking the inner product with each of the basis vector separately. That means we can solve this linear system of equations for $\mathbf{\alpha}$, the vector of coefficients!</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\begin{bmatrix}\langle\mathbf{x}, \mathbf{v_{1}}\rangle \\ \vdots \\ \langle\mathbf{x}, \mathbf{v_{N}}\rangle\end{bmatrix} &=
\begin{bmatrix}
\langle\mathbf{v_{1}}, \mathbf{v_{1}}\rangle & ... & \langle\mathbf{v_{N}}, \mathbf{v_{1}}\rangle \\
\vdots & \ddots & \vdots \\
\langle\mathbf{v_{1}}, \mathbf{v_{N}}\rangle & ... & \langle\mathbf{v_{N}}, \mathbf{v_{N}}\rangle
\end{bmatrix}\begin{bmatrix}\alpha_{1} \\ \vdots \\ \alpha_{N}\end{bmatrix} \\
\\
\mathbf{b} &= \mathbf{G\alpha} \\
\Rightarrow \mathbf{\alpha} &= \mathbf{G^{-1}b}
\end{align*} %]]></script>
<p>Where $\mathbf{G}$ is the matrix if inner products and is called the Gram Matrix or Grammian of the basis $\{\mathbf{v}\}_{n=1}^{N}$. After we solve for our coeeficientls, we can easily reconstruct the closest point in $\mathbf{T}$ to $\mathbf{x}$ by</p>
<script type="math/tex; mode=display">\mathbf{\hat{x}} = \alpha_{1}\mathbf{v_{1}} + ... + \alpha_{N}\mathbf{v_{N}}</script>
<p>Take a second to appreciate what we did. We took a minimization problem, converted it to a finite dimensional linear algebra problem by exploiting our basis to ask the question “what basis coefficients will create a $\mathbf{\hat{x}}$ that minimizes the objective?”. This idea is central to many more topics we will cover.</p>
<p>$\mathbf{G}$ is invertible because the basis vectors are linearly independent. Also, since the inner product is a symmetric function, the Gram Matrix is also symmetric. Because the Gram matrix is square and invertible, $\mathbf{b} = \mathbf{G\alpha}$ always has a solution. Further, if we have an orthogonal basis, then the Gram Matrix is exactly the Identity transformation, and the coefficients can be calculated by simply taking inner products of $\mathbf{x}$ with each basis vector.</p>
<h5 id="example">Example</h5>
<p>We will close with an example to drive this idea home. Let our Hilbert space $\mathbf{S} = \mathbb{R}^{3}$ with the standard inner product and</p>
<script type="math/tex; mode=display">\mathbf{T} = \text{Span}\Bigg(\begin{bmatrix}1 \\ 0 \\ 1\end{bmatrix}, \begin{bmatrix}-1 \\ 0 \\ 1\end{bmatrix}\Bigg), \mathbf{x} = \begin{bmatrix}2 \\ 1 \\ 0\end{bmatrix}</script>
<p>The vectors we defined $\mathbf{T}$ with form a basis for the subspace. What is the closest point in $\mathbf{T}$ to $\mathbf{x}$? We can write $\mathbf{\hat{x}}$ as</p>
<script type="math/tex; mode=display">\mathbf{\hat{x}} = \alpha_{1}\mathbf{v_{1}} + \alpha_{2}\mathbf{v_{2}}</script>
<p>and our Gram Matrix and $\mathbf{b}$ are</p>
<script type="math/tex; mode=display">% <![CDATA[
\mathbf{G} = \begin{bmatrix}
2 & 0 \\
0 & 2
\end{bmatrix}, \mathbf{b} = \begin{bmatrix}2 \\ -2\end{bmatrix} %]]></script>
<p>The inverse Gram Matrix is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{bmatrix}
\frac{1}{2} & 0 \\
0 & \frac{1}{2}
\end{bmatrix} %]]></script>
<p>Finally, $\mathbf{\alpha} = \begin{bmatrix}1 & -1\end{bmatrix}^{T}$. We reconstruct our solution using the coefficients: $\mathbf{\hat{x}} = \mathbf{v_{1}} - \mathbf{v_{2}} = \begin{bmatrix}2 & 0 & 0\end{bmatrix}^{T}$</p>
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<p><ins class="adsbygoogle" style="display:block; text-align:center;" data-ad-layout="in-article" data-ad-format="fluid" data-ad-client="ca-pub-8495937332177101" data-ad-slot="5147890219"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>$\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$Policy Gradient and Actor Critic2020-02-10T08:06:43+00:002020-02-10T08:06:43+00:00http://www.michaelpiseno.com/fractal//fractal/dl/2020/02/10/policy-gradient-actor-critic<h3 id="policy-gradient">Policy Gradient</h3>
<p>What if we could learn the policy parameters directly? We can approach this problem by thinking of policies abstractly - Let’s consider a class of policies defined by $\theta$ and refer to such a policy as $\pi_{\theta}(a|s)$ which is a probability distribution over the action space conditioned on the state $s$. These parameters $\theta$ could be the parameters of a neural network or a simple polynomial or anything really.</p>
<p>Let’s note define a metric $J$ which can be used to evaluate the quality of a policy $\pi_{\theta}$. What we really want to do is maximize the expected future reward, so naturally we can write</p>
<script type="math/tex; mode=display">J(\pi_{\theta}) = \mathbb{E}\bigg[\sum_{t=1}^{T}R(s_{t}, a_{t})\bigg]</script>
<p>where $R(s_{t}, a_{t})$ is the reward given by taking action $a$ in state $s$ and time $t$. The optimal set of parameters for the policy can then be written as</p>
<script type="math/tex; mode=display">\theta^{\ast} = \arg\max_{\theta}\mathbb{E}\bigg[\sum_{t=1}^{T}R(s_{t}, a_{t})\bigg]</script>
<p>Now consider a trajectory $\tau = (s_{1}, a_{1}, s_{2}, a_{2}, …, s_{T})$ which is a sequence of state-action pairs until the terminal state. We are trying to learn $\theta$ that maximizes the reward of some trajectory. So in the spirit of gradient descent, we are going to take actions within our environment to sample a trajectory and then use the rewards gained from that trajectory to adjust our parameters. We can write our loss function as</p>
<script type="math/tex; mode=display">J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)}[R(\tau)]</script>
<p>where $R(\tau)$ is the cumulative reward gained by our trajectory. Our objective is to take the gradient of this function with respect to $\theta$ so that we can use the gradient descent update rule to adjust our parameters, but the reward function is not known and may not even be differentiable, but with a few clever tricks we can estimate the gradient. Recall that for any continuous function $f(x)$, $\mathbb{E}[f(x)] = \int_{-\infty}^{\infty}p(x)f(x)dx$ where $p(x)$ is the probability of event $x$ occurring. So we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
J(\theta) &= \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)}[R(\tau)] \\
&= \int p(\tau)R(\tau)d\tau \\
&= \int \pi_{\theta}(\tau)R(\tau)d\tau
\end{align*} %]]></script>
<p>and</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\nabla_{\theta}J(\theta) &= \nabla_{\theta} \int \pi_{\theta}(\tau)R(\tau)d\tau \\
&= \int \nabla_{\theta}\pi_{\theta}(\tau)R(\tau)d\tau \\
&= \int \pi_{\theta}(\tau)\nabla_{\theta}\log(\pi_{\theta}(\tau))R(\tau)d\tau \\
&= \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)}[\nabla_{\theta}\log(\pi_{\theta}(\tau))R(\tau)]
\end{align*} %]]></script>
<p>Where the third line follows from the fact that $\nabla_{x}f(x) = f(x)\nabla_{x}\log(f(x))$. The fact that we have turned the gradient of our cost function $J$ into an expectation is good because that means we can estimate it by sampling data. The last piece of the puzzle is to figure out how to calculate $\nabla_{\theta}\log(\pi_{\theta}(\tau))$. Note that we can rewrite $\pi_{\theta}(\tau)$ as</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\pi_{\theta}(\tau) = \pi_{\theta}(a_{1}, s_{1}, a_{2}, s_{2}, ..., s_{T}) &= p(s_{1}) \prod_{t=1}^{T} p(a_{t}|s_{t})p(s_{t+1}|a_{t}, s_{t}) \\
&= p(s_{1}) \prod_{t=1}^{T} \pi_{\theta}(a_{t}|s_{t})p(s_{t+1}|a_{t}, s_{t})
\end{align*} %]]></script>
<p>Convince yourself that the above relation is true. $\pi_{\theta}(\tau)$ is the probability of trajectory $\tau$ happening. It is the probability of starting in $s_{1}$, then taking action $a_{1}$ given $s_{1}$, then transitioning to state $s_{2}$ given $a_{1}$ in $s_{1}$, and so on. This joint probability can be factored out. The last step is to realize $p(a_{t}|s_{t})$ is the definition of $\pi_{\theta}(a_{t}|s_{t})$. Now</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\nabla_{\theta} \log(\pi_{\theta}(\tau)) &= \nabla_{\theta}\log\bigg[p(s_{1}) \prod_{t=1}^{T} \pi_{\theta}(a_{t}|s_{t})p(s_{t+1}|a_{t}, s_{t})\bigg] \\
&= \nabla_{\theta}\bigg[\log(p(s_{1})) + \sum_{t=1}^{T}\log(\pi_{\theta}(a_{t}|s_{t})) + \sum_{t=1}^{T}\log(p(s_{t+1}|a_{t}, s_{t}))\bigg] \\
&= 0 + \nabla_{\theta}\sum_{t=1}^{T}\log(\pi_{\theta}(a_{t}|s_{t})) + 0
\end{align*} %]]></script>
<p>This simplication is enough for us to completed our estimate of the policy gradient $\nabla_{\theta}J(\theta)$.</p>
<script type="math/tex; mode=display">\nabla_{\theta}J(\theta) \approx \frac{1}{N}\sum_{n=1}^{N}\Bigg[\bigg(\sum_{t=1}^{T} \nabla_{\theta}\log(\pi_{\theta}(a_{n,t}|s_{n,t}))\bigg)\bigg(\sum_{t=1}^{T}r(s_{n,t},a_{n,t})\bigg)\Bigg]</script>
<p>Where $N$ is just the number of episodes (analogous to epochs) we do. Having a set of $N$ trajectories and then averaging the policy gradient estimate over each of them makes this estimate more robust. Now that we can estimate the policy gradient, we simply would update our parameters in the familiar way</p>
<script type="math/tex; mode=display">\theta \leftarrow \theta - \alpha\nabla_{\theta}J(\theta)</script>
<p>One interpretation of this result is that we are trying to maximize the log likelihood of trajectories that give good rewards and minimize the log likelihood of those that don’t. This is the idea behind the REINFORCE algorithm which is</p>
<ol>
<li>sample $N$ trajectories by running the policy</li>
<li>estimate the policy gradient like above</li>
<li>update the parameters $\theta$</li>
<li>Repeat until converged</li>
</ol>
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>
<h3 id="actor-critic">Actor Critic</h3>
<p>One issue with vanilla policy gradients is that its very hard to assign credit to state-action pairs that resulted in good reward because we only consider the total reward $\sum_{t=1}^{T}R(a_{t}, s_{t})$. The trajectories are noisy. But if we had the $Q$ function, we would know what state-action pairs were good. In other words, we would estimate the gradient of $J$ as</p>
<script type="math/tex; mode=display">\nabla_{\theta}J(\theta) = \mathbb{E}[\nabla_{\theta}\log(\pi_{\theta}(\tau))Q_{\pi_{\theta}}(\tau)]</script>
<p>The idea of actor-critic is that we have an actor that samples trajectories using the policy, and a critic that critiques the policy using the $Q$ function. Since we don’t have the optimal $Q$ functions, we can estimate it like we did in deep Q learning. So we could have a policy network that takes in a state and returns a probability distribution over the action space (i.e. $\pi_{\theta}(a|s))$ and a $Q$ network that takes in a state-action pair and returns its Q value estimate. Let’s say this network is parameterized by a generic variable $\beta$. Note that these don’t have to be neural networks, but for the sake of this guide I’ll just say “network”. So we have networks $\pi_{\theta}$ and $Q_{\beta}$. The general actor-critic algorithm then goes like</p>
<ol>
<li>Initialize $s, \theta, \beta$</li>
<li>Repeat until converged:
<ul>
<li>Sample action $a$ from $\pi_{\theta}(\cdot|s)$</li>
<li>Receive reward $r$ and sample next state $s’ \sim p(s’|s, a)$</li>
<li>Use the critic to evaluate the actor and update the policy similar to like we did in policy gradients:
<script type="math/tex">\theta \leftarrow \theta - \alpha\nabla_{\theta}\log(\pi_{\theta}(a|s))Q_{\beta}(s, a)</script></li>
<li>Update the critic according to some loss metric: $\text{MSE Loss} = (Q_{t+1}(s, a) - (r + \max_{a’}Q_{t}(s’, a’)))^{2}$</li>
<li>Update $\beta$ using backprop or whatever update rule</li>
</ul>
</li>
</ol>
<p>Of course you can sample whole trajectories instead of one state-action pair at a time. Different types of actor-critic result from changing the “critic”. In REINFORCE, the critic was simply the reward we got from the trajectory. In actor-critic, the critic is the Q function. Another popular choice is called advantage actor-critic, in which the critic is the advantage functions</p>
<script type="math/tex; mode=display">A_{\pi_{\theta}}(s, a) = Q_{\pi_{\theta}}(s, a) - V_{\pi_{\theta}}(s)</script>
<p>Where V is the value function (recall value iteration). The advantage function A tells us how much better is taking action $a$ in state $s$ than the expected cumulative reward of being in state $s$.</p>
<p>This concludes our discussion of RL for the Deep Learning section. In the future I will make more RL-related guides that focus on more advanced topics and current research. Feel free to reach out with any questions or if you notice something you think is inaccurate and I’ll do my best to respond!</p>
<script async="" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- horizontal -->
<p><ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-8495937332177101" data-ad-slot="8539861386" data-ad-format="auto" data-full-width-responsive="true"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script></p>Policy GradientDeep Q-Learning2020-02-05T08:06:43+00:002020-02-05T08:06:43+00:00http://www.michaelpiseno.com/fractal//fractal/dl/2020/02/05/deep-q-learning<h3 id="learning-based-methods">Learning-Based Methods</h3>
<p>Policy and Value Iteration gave us a solid way to find the optimal policy when we have perfect information about the environment (i.e. we know the distributions of state transitions and rewards), but when this information is not know, we have to get clever with how we determine good policies. One way is to learn by trial and error - taking actions in the environment and observing what states we transition to under different actions and what rewards we obtain for doing so. Doing this gives us data in the form $(s, a, r, s’)$. If we take action $a$ in state $s$ we receive reward $r$ and transition to state $s’$. From this data we can try to approximate the unknown distributions.</p>
<p>Another issue we face is large state spaces. Policy and Value iteration worked fine for Gridworld (small state space), but when the total number of states becomes large, these algorithms become intractable - they contain a $|\mathbf{S}|^{3}$ and a $|\mathbf{S}|^{2}$ term respectively in their time complexities! Our solution to this issue is to learn a lower-dimensional representation of the state using neural networks. This is known as deep reinforcement learning, and the type we will be exploring in this guide is called deep Q-Learning.</p>
<h3 id="deep-q-learning">Deep Q-Learning</h3>
<p>Essentially what we are trying to do is approximate $Q^{\ast}(s, a)$ using a neural network. If we can get a good approximation of $Q^{\ast}$, we can extract a good policy. This neural network will be parameterized by a generic term $\theta$, will take as input the state $s$ and output value for each possible action, which we can perform a max operation over to get the best action to take.</p>
<p>In order to learn such a function, we need to define a loss function so that our network knows what it’s optimizing for. Recall that the optimal Q function satisfies</p>
<script type="math/tex; mode=display">Q^{\ast}(s,a) = \mathbb{E}_{s' \sim p(s'|s, a)}\bigg[r(s, a) + \gamma \max_{a'}Q^{\ast}(s', a')\bigg]</script>
<p>Assume we have a bunch of data in the form ${(s, a, r, s’)}_{i=1}^{N}$. Then for one of the data points, we can measure how close our Q network approximates the optimal Q function by the following equation.</p>
<script type="math/tex; mode=display">\text{MSE Loss} = (Q_{t}(s, a) - [r + \max_{a} Q_{t-1}(s', a)])^{2}</script>
<p>Where $Q_{t}$ and $Q_{t-1}$ represent the network output before and after a single weight update. Notice how the first term in the square is our network’s current output and the second term is the target Q-value that we want, but based on the old network weights. The training pipeline looks something like below. First we college a batch of data (i.e. the agent takes actions in the environment) of size $B$, then we feed that data into the network, compute the loss, and update our network weights. Below, $D$ is the dimensionality of the state representation (e.g. the number of pixels in an image).</p>
<center>
<div class="col-lg-10 col-md-10 col-sm-12 col-xs-12">
<img src="/fractal/assets/Deep_Q_Learning/q-network-training.png" />
</div>
</center>
<h3 id="epsilon-greedy-and-experience-replay">Epsilon Greedy and Experience Replay</h3>
<p>This framework gives us a good way to approximate the optimal Q function, but there still remains the question of how do we actually collect the data? What policy should we use for that? To better explain this problem let’s consider an example. Say we have some sub-optimal policy $\pi_{0}$ that we will use to collect experience $(s, a, r, s’)$ data in the environment. If we simply choose the best action for each state according to this sub-optimal policy, we may not discover that some actions that are not chosen by $\pi_{0}$ lead to good rewards. Essentially, we will be stuck in local minima. One way around this is to occasionally take random actions so that we have a chance of seeing new experiences and hopefully finding better actions to take. This is an exploration strategy known as epsilon-greedy. It says that for some time $t$ the action we choose should be made according to the following rule.</p>
<script type="math/tex; mode=display">% <![CDATA[
a_{t} =
\begin{cases}
\arg\max_{a}Q_{t}(s, a) & \text{with probability } 1 - \epsilon \\
\text{random action} & \text{with probability } \epsilon
\end{cases} %]]></script>
<p>This will allow our agent to do some exploring to find good state-action combinations. Typically, it is good to do a lot of exploration when the network first starts training by using a high value for epsilon, reducing epsilon gradually as training progresses.</p>
<p>The next issue we run into is that consecutive data is highly correlated, which can lead to feedback loops or just really slow training. For example, if we are gathering data under a policy that tells the agent to move down, then data that represents this type of action will be overrepresented in the next iteration of training even though the better option might be to go right and we just haven’t explored that yet. To solve this, one solution is to maintain a buffer that stores data $(s, a, r, s’)$ that we continually update while the agent moves through the environment, removing old data as the buffer gets full. When it comes time to sample a batch of data for training, we randomly sample from this buffer rather that take a bunch of consecutive data like before. This approach is called experience replay.</p>
<p>Armed with the knowledge of these common problems and some solid ways to address them, we present the full Deep Q-Learning algorithm with Experience Replay.</p>
<center>
<figure>
<div class="col-lg-12 col-md-12 col-sm-12 col-xs-12">
<img src="/fractal/assets/Deep_Q_Learning/DQN-algorithm.png" />
<figcaption>Credit: Fei-Fei Li, Justin Johnson, Serena Yeung: CS231n</figcaption>
</div>
</figure>
</center>
<p>The function $\phi$ is just a preprocessing step before inputting the data into the neural network and can be ignored for our purposes. The curious reader can explore the full paper from DeepMind: <a href="https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf">Playing Atari with Deep Reinforcement Learning</a>.</p>
<p>We have seen a method for approximating Q function using neural networks by gathering experience data from within the environment and using it to train the network as well as some problems that arise from this approach. We have also seen reasonable ways to deal with these problems. Next, we will learn methods for estimating the optimal policy without going through the middle-man of estimating a Q function.</p>Learning-Based MethodsReinforcement Learning Background2020-02-02T08:06:43+00:002020-02-02T08:06:43+00:00http://www.michaelpiseno.com/fractal//fractal/dl/2020/02/02/reinforcement-learning<h3 id="reinforcement-learning">Reinforcement Learning</h3>
<p>Reinforcement learning (RL) is difference from supervised and unsupervised learning. In supervised learning, we have truth data (labels) for our problem that we use to check the output of our model against, correcting for mistakes accordingly. In unsupervised learning, we are learning some structure to the data. In RL we don’t have data necessarily, but instead we have an environment and a set of rules. There exists an agent that lives in this environment and its objective is to take actions that will eventually lead to reward. Whereas supervised learning tries to match data to its corresponding label, in RL we try to maximize reward. In other words, we are learning how to make the agent make a good sequence of actions.</p>
<center>
<div class="col-lg-6 col-md-6 col-sm-12 col-xs-12">
<img src="/fractal/assets/RL_Intro/rl-schema.png" />
</div>
</center>
<h3 id="framing-an-rl-problem">Framing an RL Problem</h3>
<p>We well frame an RL problem as a Markov Decision Process (MDP), which is a fancy-sounding way of formulating decision making under uncertainty. We will define the following ideas that will guide us in formulating the problem:</p>
<ul>
<li>$\mathbf{S}$: The set of possible states</li>
<li>$\mathbf{A}$: The set of possible actions the agent can take</li>
<li>$R(s, a, s’)$: A probability distribution of the reward given for being in state $s$, taking action $a$ and ending up in a new state $s’$</li>
<li>$\mathbb{T}(s, a, s’)$: A probability distribution of state transitions</li>
<li>$\gamma \in [0, 1)$: A scalar discount factor (will come in handy later)</li>
</ul>
<p>Some literature will also use $\mathbf{O}$ which is the set of possible observations given to the agent by the environment. This is sometimes the same as $\mathbf{S}$ and sometimes not. In a fully observable MDP, the agent has has all information about the state of the environment, so when the agent receives an observation $o_{i} \in \mathbf{O}$, it contains the same information as the state of the environment $s_{i} \in \mathbf{S}$. An example of this is chess - each player (agent) knows exactly what the state of the game is at any time. In a partially observable MDP this is not the case. The agent does not have access to the full state of the environment, so when it received an observation, it does not contain the same information as the state of the environment, hence these are two difference concepts. An example of this is poker - each player does not know the cards of the other players and therefore does not have access to the full state of the game.</p>
<p>The last concept is a policy, which is a function $\pi(s) : \mathbf{S} \Rightarrow \mathbf{A}$ that tells us which action to take given a state. The whole idea of RL is to learn a good policy; one that will tell us good actions to take in each state of the environment. A policy can interpreted deterministically $\pi(s)$ (The action taken when we are in state $s$), or stochastically $\pi(a|s)$ (the probability of taking action $a$ in state $s$).</p>
<p>Most of the time in RL, we do not have access to the true distributions $R(s, a, s’)$ and $\mathbb{T}(s, a, s’)$. If we had these distributions, we could easily calculate the optimal policy, however without this information we have to estimate them by trying out actions in our environment and seeing if we get reward or not.</p>
<h3 id="grid-world">Grid World</h3>
<p>For now, we will assume we have access to the distributions $R(s, a, s’)$ and $\mathbb{T}(s, a, s’)$ so that we can really drive home the point that if we have the true distributions at hand, we can calculate the optimal policy. Image we have the following problem.</p>
<ul>
<li>The agent lives in a grid, where each square is a state. This is the state space.</li>
<li>The agent can move North, South, East, or West (N, S, E, W). This is the action space.</li>
<li>80% of the time, the action the agent takes does as it is intended. 10% of the time the agent slips and moves to one side, and 10% of the time the agent slips to the other side. For example if the agent chooses to move north, there is a 80% chance it will do so, a 10% chance it will move west, and a 10% chance it will move east. This is the transition probability distribution.</li>
<li>There is a destination state that deterministically gives the agent a reward of +1 for reaching it and a terminal state that deterministically gives the agent a reward of -1 for reaching it.</li>
</ul>
<center>
<div class="col-lg-8 col-md-8 col-sm-12 col-xs-12">
<img src="/fractal/assets/RL_Intro/gridworld-example.png" />
</div>
</center>
<h3 id="finding-optimal-policies">Finding Optimal Policies</h3>
<p>So now that we have a concrete example of a problem, we can discuss what it means to find an optimal policy for it. Some questions that come when determining what a “good” policy is are “does it maximize the reward right now?” and “does is maximize the future reward?”. Typically, we maximize the discounted future reward; the idea being that we want policies that take future state into consideration, but we also don’t want the policy to focus so much on optimizing for future rewards that it doesn’t take actions that would put the agent in a good state now. Therefore we define the optimal policy $\pi^{\ast}$ in the following way.</p>
<script type="math/tex; mode=display">\pi^{\ast} = \arg \max_{\pi} \mathbb{E}\bigg[\sum_{t \geq 0} \gamma^{t}r_{t}|\pi\bigg]</script>
<p>Here, time is indexed by $t$. This means we want to maximize the expectation of the discounted reward given some policy. Notice since $\gamma$ is between 0 and 1, we will optimize for states closer in time more than ones further.</p>
<h4 id="value-function-and-q-function">Value Function and Q-Function</h4>
<p>We have a notion of what a “good” policy is, but how do we actually find it? This is where the Value function and Q function come in. The value function is a prediction of future reward and basically answers the question “how good is the current state $s$ that I’m in?”. We denote $V^{\pi}(s)$ as the expected cumulative reward of being in state $s$ and then following policy $\pi$ thereafter.</p>
<script type="math/tex; mode=display">V^{\pi}(s) = \mathbb{E}\bigg[\sum_{t \geq 0} \gamma^{t}r_{t}|s_{0}=s, \pi\bigg]</script>
<p>We also have the notion of an optimal value function $V^{\ast}(s)$, which is the expected cumulative reward of being in state $s$ and then following the optimal policy $\pi^{\ast}$ thereafter. The Q function represents a similar idea - $Q^{\pi}(s, a)$ is the expected cumulative reward for taking action $a$ in state $s$ and then following policy $\pi$ thereafter. Similarly $Q^{\ast}(s, a)$ is the expected cumulative reward of taking action $a$ in state $s$ and following the optimal policy thereafter.</p>
<script type="math/tex; mode=display">Q^{\pi}(s, a) = \mathbb{E}\bigg[\sum_{t \geq 0} \gamma^{t}r_{t}|s_{0}=s, a_{0}=a, \pi\bigg]</script>
<p>Remember, the value function only deals with states, and the Q function deals with state-action pairs! Now we can go about defining the optimal value and policy from the Q function values. It is clear that the optimal value and policy for a state can be defined in terms of the Q function as follows.</p>
<script type="math/tex; mode=display">V^{\ast}(s) = \max_{a}Q^{\ast}(s, a)</script>
<script type="math/tex; mode=display">\pi^{\ast}(s) = \arg \max_{a}Q^{\ast}(s, a)</script>
<p>These optimal values can be calculated recursively using what are called the Bellman equations, defined below. Notice how the calculation of these values requires we have access to the true distributions $\mathbb{T}(s’, a, s)$ (denoted with $p(\cdot)$ below) and $R(s’, a, s)$ (denoted with $r(\cdot)$ below).</p>
<script type="math/tex; mode=display">V^{\ast}(s) = \max_{a}\sum_{s'}p(s'|s, a)[r(s, a) + \gamma V^{\ast}(s')]</script>
<script type="math/tex; mode=display">Q^{\ast}(s, a) = \sum_{s'}p(s'|s, a)[r(s, a) + \gamma V^{\ast}(s')]</script>
<p>The summation over all possible next-states $s’$ of $p(s’|s, a)$ comes from the definition of expectation in probability $\mathbb{E}[f(\cdot)] = \sum_{x}p(x) \cdot f(x)$. We are summing over all subsequent states the probability of being in that state, given the current state and action, then multiplying by the reward we get for being in that next state. It should be clear that the expected reward of being in state $s$, taking action $a$ and ending up in state $s’$ is exactly $r(s, a) + \gamma V^{\ast}(s’)$.</p>
<p>To reiterate, if we know the distributions $\mathbb{T}$ and $R$, we have a recursive way of calculating the optimal Q value of any state-action pair, and hence we can extract the optimal policy. Now we will go over two algorithms for doing so.</p>
<h3 id="value-iteration">Value Iteration</h3>
<p>The idea of value iteration pretty much exactly follows the logic we described above. The algorithm is as follows.</p>
<center>
<div class="col-lg-10 col-md-10 col-sm-12 col-xs-12">
<img src="/fractal/assets/RL_Intro/VI-algorithm.png" />
</div>
</center>
<p>Each iteration of Value Iteration costs $O(|\mathbf{S}|^{2}|\mathbf{A}|)$ time and is very expensive for large state spaces. Recall our grid world game with values for each state initialized to 0.</p>
<center>
<div class="col-lg-6 col-md-6 col-sm-12 col-xs-12">
<img src="/fractal/assets/RL_Intro/gridworld-VI-step1.png" />
</div>
</center>
<p>Let’s do an example calculation of one iteration of Value Iteration on the state (3, 3) (where the agent is pictured).</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
V^{2}((3, 3)) &= \max_{a}\sum_{s'}p(s'|(3, 3), a)[r(s, (3, 3)) + \gamma V^{1}(s')] \\
&= \sum_{s'\in \{(4, 3), (3, 2)\}} p(s'|(3, 3), \text{right})[r((3, 3), \text{right}) + \gamma V^{1}(s')] \\
&= (0.8 * (0 + \gamma * 1)) + (0.1 (0 + \gamma * 0)) + (0.1 (0 + \gamma * 0)) \\
&= 0.8\gamma
\end{align*} %]]></script>
<p>Note that the above calculation did not include other actions for brevity since we already knew the max operation would give us right as the optimal action. Now state (3, 3) has value $0.8\gamma$ and we can keep recursing to calculate the values of all the other states. After doing so, this would be 1 iteration of Value Iteration. We would repeat this process until the values converge.</p>
<h3 id="policy-iteration">Policy Iteration</h3>
<p>The next algorithm we will discuss is called Policy Iteration. The idea is that we start with some policy $\pi_{0}$ and iteratively refine it until the policy does not change anymore (i.e. it has converged). The algorithm involves two steps: computing the value of a policy, then using those values to greedily change the actions chosen by the previous policy to create a new policy.</p>
<center>
<div class="col-lg-10 col-md-10 col-sm-12 col-xs-12">
<img src="/fractal/assets/RL_Intro/PI-algorithm.png" />
</div>
</center>
<p>Policy Iteration has time complexity $O(|\mathbf{S}|^{3})$ for each iteration because of the linear system of equations, but in practice it often converges faster than Value Iteration because the policy becomes locked in place faster than the values in Value Iteration.</p>
<p>Next time we will discuss how to find good policies even when the distributions $\mathbb{T}$ and $R$ are not known. This will largely amount to taking exploratory actions in the environment to collect data about what sequences of actions give good rewards and what sequences don’t. This opens up the door to the field of RL which we will soon begin exploring.</p>Reinforcement LearningSupport Vector Machines2020-01-26T08:06:43+00:002020-01-26T08:06:43+00:00http://www.michaelpiseno.com/fractal//fractal/ml/2020/01/26/SVMs-and-kernels<p>$\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$</p>
<h3 id="test">Test</h3>$\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$Vector Spaces, Norms, and Inner Products2020-01-23T08:06:43+00:002020-01-23T08:06:43+00:00http://www.michaelpiseno.com/fractal//fractal/mfml/2020/01/23/vector-spaces-norms-and-inner-products<p>$\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$</p>
<h3 id="vector-spaces">Vector Spaces</h3>
<p>We will begin our study of the mathematical foundations of machine learning by considering the idea of a vector space. A vector space $\mathbf{S}$ is a set of elements called vectors that obey the following</p>
<ul>
<li>For $\mathbf{x, y, z} \in \mathbf{S}$:
<ul>
<li>$\mathbf{x} + \mathbf{y} = \mathbf{y} + \mathbf{x}$ (commutative)</li>
<li>$\mathbf{x} + (\mathbf{y} + \mathbf{z}) = (\mathbf{x} + \mathbf{y}) + \mathbf{z}$ (associative)</li>
<li>$\mathbf{x} + 0 = \mathbf{x}$</li>
</ul>
</li>
<li>Scalar multiplication is distributive and associative</li>
<li>$\mathbf{S}$ is closed under scalar multiplication and vector addition. i.e.
$\mathbf{x}, \mathbf{y} \in \mathbf{S} \implies a\mathbf{x} + b\mathbf{y} \in S \quad \forall a, b \in \mathbb{R}$</li>
</ul>
<p>The last bullet is arguably the most important and describes the more descriptive “linear vector space”.</p>
<p>A couple examples of linear vectors spaces are:</p>
<ol>
<li>
<p>$\mathbb{R}^{N}$</p>
<script type="math/tex; mode=display">\mathbf{x} = \begin{bmatrix}x_{1} \\ \vdots \\ x_{N}\end{bmatrix}</script>
<p>Note that the addition of any two vectors in $\mathbb{R}^{N}$ is also a vector in $\mathbb{R}^{N}$.</p>
</li>
<li>
<p>The set of all polynomials of degree $N$</p>
<p>Note that for polynomials $p(x) = \alpha_{N}x^{N} + … + \alpha_{1}x + \alpha_{0}$ and $t(x) = \beta_{N}x^{N} + … + \beta_{1}x + \beta_{0}$, $ap(x) + bt(x)$ is still a polynomial of degree $N$ for any choice of $a$ and $b$, therefore the space of all degree $N$ polynomials is a linear vector space.</p>
</li>
</ol>
<p>Thinking of functions as elements of a vector space might seem strange, but we will soon see that functions can be represented as discrete sets of numbers (i.e. vectors) and manipulated the same way that we normally think about manipulating vectors in $\mathbb{R}^{N}$.</p>
<h4 id="linear-subspaces">Linear Subspaces</h4>
<p>Now that we have the notion of a vector space, we can introduce the idea of a linear subspace, which is a mathematical tool that will soon become useful. A linear subspace is a subset $\mathbf{T}$ of a vector space $\mathbf{S}$ that contains the zero vector (i.e. $\mathbf{0} \in \mathbf{T}$) and is closed under vector addition and scalar multiplication.</p>
<script type="math/tex; mode=display">\mathbf{x}, \mathbf{y} \in \mathbf{T} \implies a\mathbf{x} + b\mathbf{y} \in T \quad \forall a, b \in \mathbb{R}</script>
<div class="container">
<div class="row">
<div class="col-lg-6 col-md-6 col-sm-12 col-xs-12">
<figure class="figure">
<img src="/fractal/assets/Vector_Spaces/subspace_counterexample.png" />
<figcaption class="figure-caption text-center">T is not a linear subspace</figcaption>
</figure>
</div>
<div class="col-lg-6 col-md-6 col-sm-12 col-xs-12">
<figure class="figure">
<img src="/fractal/assets/Vector_Spaces/subspace_example.png" />
<figcaption class="figure-caption text-center">T is a linear subspace</figcaption>
</figure>
</div>
</div>
</div>
<p>In the figure above we see on the left a counter example of a linear subspace. It is a counter example because it does not contain the zero vector and also because it is easy to see we could take a linear combination of two vectors in $\mathbf{T}$ to get a vector outside $\mathbf{T}$, so both conditions are violated. This is not the case for the subspace on the right, and it is in fact a linear subspace of $\mathbf{S} = \mathbb{R}^{2}$.</p>
<h3 id="norms">Norms</h3>
<p>A Vector space is a set of elements that obey certain properties. By introducing a norm to a particular vector space, we are giving it a sense of distance. A norm $\norm{\cdot}$ is a mapping from a vector space $\mathbf{S}$ to $\mathbb{R}$ such that for all $\mathbf{x, y} \in \mathbf{S}$,</p>
<ol>
<li>$\norm{\mathbf{x}} \geq 0$ and $\norm{\mathbf{x}} = 0 \iff \mathbf{x} = \mathbf{0}$</li>
<li>$\norm{\mathbf{x} + \mathbf{y}} \leq \norm{\mathbf{x}} + \norm{\mathbf{y}}$ (triangle inequality)</li>
<li>$\norm{a\mathbf{x}} = |a|\norm{\mathbf{x}}$ (homogeneity)</li>
</ol>
<p>This definition should feel familiar. The norm of a vector $\norm{\mathbf{x}}$ is its distance from the origin and the norm of the difference of two vectors $\norm{\mathbf{x - y}}$ is the distance between the two vectors. Here are some examples of norms that we will be using later on.</p>
<ol>
<li>
<p>The standard euclidean norm (aka the $\ell_{2}$ norm): $\mathbf{S} = \mathbb{R}^{N}$</p>
<script type="math/tex; mode=display">\norm{\mathbf{x}}_{2} = \sqrt{\sum_{n=1}^{N}|x_{n}|^{2}}</script>
</li>
<li>
<p>$\mathbf{S} = $ the set of continuous functions on $\mathbb{R}$ ($\mathbf{x}$ is a function)</p>
<script type="math/tex; mode=display">\norm{\mathbf{x}}_{2} = \sqrt{\int_{-\infty}^{\infty}|x(t)|^{2}dt}</script>
</li>
</ol>
<h3 id="inner-products">Inner Products</h3>
<p>By now we have introduced vector spaces and normed vector spaces, the latter being a subset of the former. Now we will introduce the inner product. The inner product $\langle\cdot, \cdot\rangle$ is a function that takes two vectors in a vector space and produces a real number (or complex number, but we will ignore this for now).</p>
<script type="math/tex; mode=display">\langle\cdot,\cdot\rangle: \mathbf{S}\times\mathbf{S}\rightarrow \mathbb{R}</script>
<p>A valid inner product obeys three rules for $\mathbf{x, y, z}\in\mathbf{S}$:</p>
<ol>
<li>
<p>$\langle\mathbf{x},\mathbf{y}\rangle = \langle\mathbf{y},\mathbf{x}\rangle$ (symmetry)</p>
</li>
<li>
<p>For $a, b \in \mathbb{R}$</p>
<p><script type="math/tex">\langle a\mathbf{x} + b\mathbf{y}, \mathbf{z}\rangle = a\langle\mathbf{x}, \mathbf{z}\rangle + b\langle\mathbf{y}, \mathbf{z}\rangle</script> (linearity)</p>
</li>
<li>
<p>$\langle\mathbf{x}, \mathbf{x}\rangle \geq 0$ and $\langle\mathbf{x}, \mathbf{x}\rangle = 0 \iff \mathbf{x} = \mathbf{0}$</p>
</li>
</ol>
<p>Two important examples of inner products are</p>
<ol>
<li>
<p>The standard inner product (aka the dot product): $\mathbf{S} = \mathbb{R}^{N}$</p>
<script type="math/tex; mode=display">\langle\mathbf{x},\mathbf{y}\rangle = \sum_{n=1}^{N}x_{n}y_{n} = \mathbf{y}^{T}\mathbf{x}</script>
</li>
<li>
<p>The standard inner product for continuous functions on $\mathbb{R}^{N}$. If $\mathbf{x, y}$ are two such functions</p>
<script type="math/tex; mode=display">\langle\mathbf{x}, \mathbf{y}\rangle = \int_{-\infty}^{\infty}x(t)y(t)dt</script>
</li>
</ol>
<p>The last concept I want to introduce is the idea of an induced norm. It is a fact that every valid inner product induces a valid norm (but not the other way around). This induces norm has very useful properties that not all other norms have. For some inner product $\langle\cdot,\cdot\rangle_{\mathbf{S}}$ on a vector space $\mathbf{S}$, the induced norm is defined as</p>
<script type="math/tex; mode=display">\norm{\mathbf{x}}_{\mathbf{S}} = \sqrt{\langle\mathbf{x},\mathbf{x}\rangle_{\mathbf{S}}}</script>
<p>The standard inner product induces the standard euclidean norm. Two important properties of induced norms (not all norms!) are</p>
<ol>
<li>
<p>The Cauchy-Schwartz Inequality:</p>
<script type="math/tex; mode=display">|\langle\mathbf{x},\mathbf{y}\rangle| \leq \norm{\mathbf{x}}\norm{\mathbf{y}}</script>
</li>
<li>
<p>Pythagorean Theorem:</p>
<p>If $\langle\mathbf{x},\mathbf{y}\rangle = 0$ then $\mathbf{x}$ and $\mathbf{y}$ are orthogonal and $\norm{\mathbf{x} + \mathbf{y}} = \norm{\mathbf{x}} + \norm{\mathbf{y}}$</p>
</li>
</ol>
<p>A Hilbert space is an inner product space that is also complete, which means that for every infinite sequence of elements $\mathbf{x_{1}}, \mathbf{x_{2}}, … $ that gets closer and closer to one another, these elements also approach some precise element in the space. In more rigorous terms, it means that every Cauchy sequence is convergent, but the spaces we discuss in these guides will have this completeness property unless otherwise stated, so I will use Hilbert space and inner product space more or less interchangeably. Just keep the completeness requirement in the back of your mind.</p>
<p>All the ideas presented in these notes are important foundational mathematical concepts that we will make use of in later notes. You should become very familiar with them and know how to determine if an inner product or a norm is valid or not. Now that we have some mathematical tools, next time we will discuss a foundational problem is machine learning - linear approximation.</p>$\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$Machine Learning Introduction2020-01-15T08:06:43+00:002020-01-15T08:06:43+00:00http://www.michaelpiseno.com/fractal//fractal/supp/2020/01/15/what-is-ml<h3 id="what-is-machine-learning">What is Machine Learning?</h3>
<p>Machine Learning (ML) is, as Tom Mitchell stated, “The study of algorithms that improve their performance P at some task T with experience E”. Another way of looking at it is that we are learning an algorithm that solves an inference problem or a model that describes some data set. We will discuss these two concepts at a basic level below and then introduce some cool things machine learning and deep learning have accomplished to hopefully incite some interest.</p>
<h4 id="inference">Inference</h4>
<p>Inference means making a decision or prediction about some sort of data, perhaps in a probabilistic sense. For example, I might give you an image and say “is there a cat in this image?” or “what is the probability that there is a cat in this image?”</p>
<p><img src="/fractal/assets/MLIntro/kit.jpg" alt="cat" width="200" height="150" /><img src="/fractal/assets/MLIntro/cat.jpg" alt="cat" width="200" height="150" /><img src="/fractal/assets/MLIntro/catinbox.jpg" alt="cat" width="150" height="150" /><img src="/fractal/assets/MLIntro/catbowl.jpg" alt="cat" width="150" height="150" /></p>
<p>As you might imagine, it gets more complex and ambiguous when the thing you’re trying to predict is secluded in some way or only represents the idea of the thing you’re trying to predict instead of being the thing itself (e.g. the cat-faced bowl above).</p>
<p>There are perhaps more interesting inference problems where you don’t have complete information, but you know some related information. For example, if I give you the temperature in San Francisco, San Jose, and Fremont, can you predict (infer) the temperature in Palo Alto? Or If I give you the position and velocity of a car at time $t_{1}$, can you tell me the probability the car will be 5 meters north at time $t_{2}$? The output of an inference algorithm can either be a concrete decision (e.g. the image does have a cat in it) or a probability distribution over the set of possible outcomes (e.g. there is a 60% chance that the temperate in Palo Alto is between 50 and 65 degrees Fahrenheit).</p>
<h4 id="modeling">Modeling</h4>
<p>Modeling allows us to describe data either qualitatively or numerically. There are typically geometric models, which are ones that try to find geometric structure in data and probabilistic models, which try to find a probability distribution given a bunch of samples of random variables.</p>
<p>An example of a geometric model might be: I have (square-footage, location, number of bedrooms) information for 1000 houses. How can I find some combination of these attributes that still comes close to fitting the data? This boils down to finding a lower-dimensional subspace that comes close to containing all the original data.</p>
<figure>
<center>
<img src="/fractal/assets/MLIntro/subspace.png" alt="subspace" width="300" height="200" />
<figcaption>Data close to a lower dimensional subspace</figcaption>
</center>
</figure>
<h3 id="why-study-ml">Why study ML?</h3>
<p>Besides all the buzzwords, machine learning, and indeed artificial intelligence in general, can do some pretty amazing stuff. More than 20 years ago we created AIs that could beat the world’s best chess player, and more recently DeepMind’s AlphaGo beat the world’s best Go player. These are well known examples. As we will learn soon, machine learning provides us with powerful tools to describe data. However, my personal favorite reason for studying machine learning is that it gives us a solid foundation to study more advanced topics such as those typically referred to as deep learning. This umbrella term includes everything from convolutional neural networks to generative adversarial networks to reinforcement learning agents that learn to play hide and seek. Studying ML provides us with mathematical and algorithmic frameworks for analyzing these exciting topics.</p>
<p>I’ll end with a relatively recent video by OpenAI, where agents learn how to coordinate with other agents to play hide and seek.</p>
<center>
<iframe src="https://www.youtube.com/embed/kopoLzvh5jY" width="400" height="300"></iframe>
<figcaption>Credit: OpenAI</figcaption>
</center>What is Machine Learning?