## 策略梯度是什么?

$\theta_{k+1} = \theta_{k} + \alpha \nabla J(\pi_{\theta})\rvert_{\theta=\theta_{k}}$

$$\nabla J(\pi_{\theta})$$叫 “策略梯度”。很多算法用策略梯度，比方说 TRPO, PPO, DDPG, 什么的。现在我们会派生策略梯度。

### 派生策略梯度

1. 轨迹的概率 $$\tau$$:

\begin{align*} P(\tau) &= P(s_0, a_1, s_1, ..., s_T, a_T) \\ &= P(s_0)\prod_{t=0}^{T}P(s_{t+1}|s_t, a_t)\pi_{\theta}(a_t|s_t) \end{align*}
2. 对数导数方法：为一切函数 $$f$$，我们有

$\nabla_{\theta}f(x) = f(x)\nabla_{\theta}\log f(x)$

\begin{align*} \nabla_{\theta}J(\pi_{\theta}) &= \nabla_{\theta} \mathbb{E}_{\tau \sim \pi_{\theta}}[R(\tau)] \\ &= \nabla_{\theta} \int_{\tau} P(\tau | \theta) R(\tau) \\ &= \int_{\tau} \nabla_{\theta} P(\tau | \theta) R(\tau) \\ &= \int_{\tau} P(\tau | \theta) \nabla_{\theta}\log P(\tau | \theta) R(\tau) \quad \text{对数导数方法} \\ &= \mathbb{E}_{\tau \sim \pi_{\theta}}[\nabla_{\theta}\log P(\tau | \theta) R(\tau)] \quad \text{期望} \\ &= \mathbb{E}_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{T}\nabla_{\theta}\log \pi_{\theta}(a_t|s_t)R(\tau)] \end{align*}

\begin{align*} \nabla_{\theta}\log P(\tau | \theta) &= \nabla_{\theta}\log P(s_0) + \sum_{t=0}^{T}\bigg(\nabla_{\theta}\log P(s_{t+1} | s_t, a_t) + \nabla_{\theta}\log \pi_{\theta}(a_t | s_t)\bigg) \\ &= 0 + \sum_{t=0}^{T}\bigg(0 + \nabla_{\theta}\log \pi_{\theta}(a_t | s_t)\bigg) \\ &= \sum_{t=0}^{T}\nabla_{\theta}\log \pi_{\theta}(a_t | s_t) \end{align*}

### 估计策略梯度

$$$\hat{g} = \frac{1}{|\mathcal{D}|} \sum_{\tau \in \mathcal{D}}\sum_{t=0}^{T} \nabla_{\theta}\log \pi_{\theta}(a_t | s_t)R(\tau)$$$

$\theta_{k+1} = \theta_{k} + \alpha \hat{g}$

## 基准

$$$\mathbb{E}_{a_t,s_t \sim \pi_{\theta}}[\nabla_{\theta}\log \pi_{\theta}(a_t|s_t)b(s_t)] = 0$$$

$A^{\pi_{\theta}}(s_t, a_t) = Q^{\pi_{\theta}}(s_t, a_t) - V^{\pi_{\theta}}(s_t)$

$$$\nabla_{\theta} J(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}}\bigg[\sum_{t=0}^{T} \nabla_{\theta}\log \pi_{\theta}(a_t | s_t)b(s_t)\bigg]$$$