I regularly use the Expected Improvement (EI) as an acquisition function in Bayesian optimization (BO), but I’ve never fully understood the derivation of its closed form solution. This blog post aims to derive EI step by step.

## Background

Assume we have an optimization problem $x^* = \argmax_x f(x)$

In BO with EI as an acquisition function, we want to choose the point that is expected to improve the most upon the existing best observed point $\hat y^* = f(\hat x^*) \geq f(x_i) \forall i \in (1, \dots, t)$ . $t$ is the number of observations thus far.

More formally, we can create an improvement function $I(x)$ that tells us how much better the posterior of our probabilistic model at $x$ is than our best observed point $\hat y^* = f(\hat x^*)$. If it the posterior is less than the best observed point, we make $I(x)$ zero.

$y_{\theta}$ is most commonly the posterior of a Gaussian process, which, at a specific input $x$, is a normal distribution:

For EI, we want the expectation of the improvement:

If you’ve ever used EI or read about it in a paper, you might have been confused how to get the following closed form of EI:

where $\Phi(\cdot)$ is the cumulative distribution function of the normal distribution and $\phi(\cdot)$ is the probability density function of the normal distribution. Let’s derive that closed form!

## Derivation

We can write out the lebesgue integral for the expectation:

The upper limit of the expectation integral can actually stop at $\hat y^*$ since we consider $I(x)$ to be 0 when $y_{\theta}(x)>\hat y^*$:

Substituting in the definition $I(x)$:

It’s not obvious how to solve this integral since it is in terms of a parametrized random variable $y_{\theta}$. To find a closed form for the integral, we can use the reparameterization trick, which essentially replaces a parametrized random variable (in our case $y_{\theta}$) with a deterministic function multiplied by a non-parametrized random variable. As a side note, the reparametrization trick is most famously what makes variational autoencoders possible, but it also has applications across Bayesian optimization.

Our reparameterization is as follows: $y_{\theta}=\mu_{\theta}(x)+\sigma_{\theta}(x)\epsilon$ where $\epsilon \sim \mathcal N(0,1)$. Therefore:

So, the limits of the integral become:

- Lower: $\epsilon(-\infty) = \frac{-\infty-\mu_{\theta}(x)}{\sigma_{\theta}(x)} \approx -\infty$
- Upper: $Z^* = \epsilon(\hat y^*)=\frac{\hat y^*-\mu_{\theta}(x)}{\sigma_{\theta}(x)}$

Now, substituting into the integral:

That above step just uses the definition of the CDF, recalling that $\epsilon$ is normally distributed. If you replace $Z^*$, you’ll see that the first term is already the same as our desired form. For the second term, we need to substitute in the equation for the PDF of the normal distribution and simplify. We then get:

We can use the substitution $u=-x^2/2$ ($du=-xdx$), so:

Just note that this $p(\epsilon)=\phi(\epsilon)$. Therefore, putting this all together, we get desired final result:

As Martin Krasser points out, the two terms we get in this closed form represent the classic exploration-exploitation tradeoff. The first term is the exploitation term that that primarily takes into account the improvement of the mean over our best observed value (though $Z^*$ is weighted by $\sigma_{\theta}(x)$ to reduce this exploitation under high uncertainty). The second term focuses on exploration by allocating more weight to predictions with large uncertainty.

Also, note that, in practice, we would implement EI using an if statement to set $EI(x)$ to zero if there is no predicted improvement.

Thanks to following resources: