Derivation of the Expected Improvement Acquisition Function

Derivation of the Expected Improvement Acquisition Function

I regularly use the Expected Improvement (EI) as an acquisition function in Bayesian optimization (BO), but I’ve never fully understood the derivation of its closed form solution. This blog post aims to derive EI step by step.

Background

Assume we have an optimization problem x=arg maxxf(x)x^* = \argmax_x f(x)

In BO with EI as an acquisition function, we want to choose the point that is expected to improve the most upon the existing best observed point y^=f(x^)f(xi)i(1,,t)\hat y^* = f(\hat x^*) \geq f(x_i) \forall i \in (1, \dots, t) . tt is the number of observations thus far.

More formally, we can create an improvement function I(x)I(x) that tells us how much better the posterior of our probabilistic model at xx is than our best observed point y^=f(x^)\hat y^* = f(\hat x^*). If it the posterior is less than the best observed point, we make I(x)I(x) zero.

I(x)=max(yθ(x)f(x^),0)I(x) = \max(y_{\theta}(x) -f(\hat x^*), 0)

yθy_{\theta} is most commonly the posterior of a Gaussian process, which, at a specific input xx, is a normal distribution:

yθ(x)N(μθ(x),σθ2(x))y_{\theta}(x) \sim \mathcal{N}(\mu_{\theta}(x), \sigma^2_{\theta}(x))

For EI, we want the expectation of the improvement:

EI(x)=Ey[I(x)]EI(x) = \mathbb E_{y}[I(x)]

If you’ve ever used EI or read about it in a paper, you might have been confused how to get the following closed form of EI:

EI(x)=(μθ(x)y^)Φ(y^μθ(x)σθ(x))+σθ(x)ϕ(y^μθ(x)σθ(x))EI(x) =(\mu_{\theta}(x)-\hat y^*)\Phi \biggl(\frac{\hat y^*-\mu_{\theta}(x)}{\sigma_{\theta}(x)}\biggr) + \sigma_{\theta}(x) \phi \biggl(\frac{\hat y^*-\mu_{\theta}(x)}{\sigma_{\theta}(x)}\biggr)

where Φ()\Phi(\cdot) is the cumulative distribution function of the normal distribution and ϕ()\phi(\cdot) is the probability density function of the normal distribution. Let’s derive that closed form!

Derivation

EI(x)=I(x)p(yθx)dyEI(x) = \int_{-\infty}^{\infty}I(x)p(y_{\theta}\vert x)dy

The upper limit of the expectation integral can actually stop at y^\hat y^* since we consider I(x)I(x) to be 0 when yθ(x)>y^y_{\theta}(x)>\hat y^*:

EI(x)=y^I(x)p(yθx)dyEI(x) = \int_{-\infty}^{\hat y^*}I(x)p(y_{\theta} \vert x)dy

Substituting in the definition I(x)I(x):

y^(yθ(x)y^)p(yθx)dy\int_{-\infty}^{\hat y^*}(y_{\theta}(x)-\hat y^*)p(y_{\theta} \vert x)dy

It’s not obvious how to solve this integral since it is in terms of a parametrized random variable yθy_{\theta}. To find a closed form for the integral, we can use the reparameterization trick, which essentially replaces a parametrized random variable (in our case yθy_{\theta}) with a deterministic function multiplied by a non-parametrized random variable. As a side note, the reparametrization trick is most famously what makes variational autoencoders possible, but it also has applications across Bayesian optimization.

Our reparameterization is as follows: yθ=μθ(x)+σθ(x)ϵy_{\theta}=\mu_{\theta}(x)+\sigma_{\theta}(x)\epsilon where ϵN(0,1)\epsilon \sim \mathcal N(0,1). Therefore:

ϵ=yμθ(x)σθ(x)\epsilon = \frac{y-\mu_{\theta}(x)}{\sigma_{\theta}(x)}

So, the limits of the integral become:

  • Lower: ϵ()=μθ(x)σθ(x)\epsilon(-\infty) = \frac{-\infty-\mu_{\theta}(x)}{\sigma_{\theta}(x)} \approx -\infty
  • Upper: Z=ϵ(y^)=y^μθ(x)σθ(x)Z^* = \epsilon(\hat y^*)=\frac{\hat y^*-\mu_{\theta}(x)}{\sigma_{\theta}(x)}

Now, substituting into the integral:

EI(x)=Z(μ(x)+σ(x)ϵy^)p(ϵ)dϵEI(x) = \int_{-\infty}^{Z^*}(\mu(x) + \sigma(x)\epsilon-\hat y^*)p(\epsilon)d\epsilon
EI(x)=(μ(x)y^)Zp(ϵ)dϵ σ(x)Zϵp(ϵ)dϵEI(x) =(\mu(x)-\hat y^*)\int_{-\infty}^{Z^*} p(\epsilon)d\epsilon -  \sigma(x) \int_{-\infty}^{Z^*} \epsilon p(\epsilon) d\epsilon
EI(x)=(μ(x)y^)Φ(Z) σ(x)Zϵp(ϵ)dϵEI(x) =(\mu(x)-\hat y^*)\Phi(Z^*) -  \sigma(x) \int_{-\infty}^{Z^*} \epsilon p(\epsilon) d\epsilon

That above step just uses the definition of the CDF, recalling that ϵ\epsilon is normally distributed. If you replace ZZ^*, you’ll see that the first term is already the same as our desired form. For the second term, we need to substitute in the equation for the PDF of the normal distribution and simplify. We then get:

p(ϵ)=ϕ(ϵ)=12πexp(12(ϵ01)2)=12πeϵ2/2p(\epsilon) = \phi(\epsilon) = \frac{1}{ \sqrt{2\pi}} \exp\biggl(-\frac{1}{2} \biggl( \frac{\epsilon-0}{1}\biggr)^2\biggr) = \frac{1}{\sqrt{2\pi}}e^{-\epsilon^2/2}
Zϵp(ϵ)dϵ=12πZϵeϵ2/2dϵ=12πeϵ2/2Z=ϕ(Z)ϕ()=ϕ(Z)\int_{-\infty}^{Z^*} \epsilon p(\epsilon) d\epsilon= \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{Z^*} \epsilon e^{-\epsilon^2/2}d\epsilon = \frac{-1}{\sqrt{2\pi}} e^{-\epsilon^2/2}\vert^{Z^*}_{-\infty} = \phi(Z^*)-\phi(-\infty)=\phi(Z^*)
Proof of the final integral
abxex2/2dx\int_{a}^{b} x e^{-x^2/2}dx

We can use the substitution u=x2/2u=-x^2/2 (du=xdxdu=-xdx), so:

abeudu=euab=ex2/2ab\int_{a'}^{b'} -e^udu = -e^u \vert_{a'}^{b'} =-e^{-x^2/2}\vert_{a}^{b}

Just note that this p(ϵ)=ϕ(ϵ)p(\epsilon)=\phi(\epsilon). Therefore, putting this all together, we get desired final result:

EI(x)=(μθ(x)y^)Φ(Z)Exploitation+σθ(x)ϕ(Z)ExplorationEI(x) =\underbrace{(\mu_{\theta}(x)-\hat y^*)\Phi(Z^*)}_{\text{Exploitation}} + \underbrace{\sigma_{\theta}(x) \phi(Z^*)}_{\text{Exploration}}

As Martin Krasser points out, the two terms we get in this closed form represent the classic exploration-exploitation tradeoff. The first term is the exploitation term that that primarily takes into account the improvement of the mean over our best observed value (though ZZ^* is weighted by σθ(x)\sigma_{\theta}(x) to reduce this exploitation under high uncertainty). The second term focuses on exploration by allocating more weight to predictions with large uncertainty.

Also, note that, in practice, we would implement EI using an if statement to set EI(x)EI(x) to zero if there is no predicted improvement.

Thanks to following resources: