Jason Blevins’s Notebook
Chib and Greenberg (1995)

Citation

Chib, Siddhartha and Edward Greenberg (1995): “Understanding the Metropolis Hastings Algorithm,” American Statistical Journal, 49, 327–335.

@Article{chib95understanding,
  author       = {Siddhartha Chib and Edward Greenberg},
  title        = {Understanding the {Metropolis-Hastings} Algorithm},
  journal      = {American Statistical Association},
  year         = 1995,
  volume       = 49,
  pages        = {327--335}
}

Introduction

The Metropolis-Hastings algorithm is a Markov chain Monte Carlo method developed by Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller (1953) and generalized by Hastings (1970). This algorithm is very general and gives rise to the Gibbs sampler as a special case. This article provides a self-contained intuitive development of the algorithm from first principles. The paper focuses on absolutely continuous target densities, but the algorithm also applies to discrete and mixed distributions.

Acceptance-Rejection Sampling

Classical sampling methods typically generate independent samples. One such method is the A-R method. The objective is to sample from an absolutely continuous target density π(x)=f(x)/K, where x d, f(x) is the unnormalized target density, and K the potentially unknown normalizing constant. Suppose that we can sample from another density h(x) and that there exists a constant c such that f(x)ch(x) for all x. To obtain a draw from π:

  1. Draw a candidate Z from h and u from U(0,1 ).
  2. If uf(Z)ch(Z), return Z.
  3. Otherwise, return to 1.

The expected number of iterations required to accept a draw is c 1 . To ensure efficiency, the optimal choice of c is

c=sup xf(x)h(x).
c = \sup_x \frac{f(x)}{h(x)}.

Markov Chain Monte Carlo Simulation

Consider a Markov transition kernel P(x,A) for x d and A, where is the Borel σ-algebra on d. Two major concerns in Markov chain theory are whether there exists an invariant distribution and whether iterating the transition kernel converges to the invariant distribution. The invariant distribution π * satisfies

π *(dy)= dP(x,dy)π(x)dx,
\pi^*(dy) = \int_{\mathbb{R}^d} P(x, dy) \pi(x)\,dx,

where π is the density of π * with respect to Lebesgue measure. Let

P (n)(x,A)= dP (n1 )(x,dy)P(y,A)
P^{(n)}(x,A) = \int_{\mathbb{R}^d} P^{(n-1)}(x,dy)P(y,A)

denote the n-th iteration of the transition kernel. Conditions are described under which P (n)(x,A) converges to π * as n. That is, the distribution of the draws generated by the iterations is approximately π *.

MCMC methods look at the problem from the opposite perspective. The invariant distribution is known (the target distribution). The problem is how to generate an appropriate transition kernel with the aforementioned convergence property. Suppose that the transition kernel can be expressed as

P(x,dy)=p(x,y)dy+r(x)δ x(dy)
P(x, dy) = p(x,y)\,dy + r(x) \delta_x(dy)

where p(x,x)=0 , δ x(dy)=1 if xdy and 0 otherwise, and define r(x)=1 dp(x,y)dy. The r(x) term represents the probability of staying at x which may be nonzero. If this kernel satisfies the reversibility condition

(1)π(x)p(x,y)=π(y)p(y,x),
\pi(x)p(x,y) = \pi(y)p(y,x),

then π is the invariant distribution of P(x,). The Metropolis-Hastings algorithm generates a p(x,y) with this property.

The Metropolis-Hastings Algorithm

Let q(x,y) denote the candidate-generating density, where q(x,y)dy=1 . This is essentially the conditional density of y given x. This density could potentially be used as the p(x,y) term in the transition kernel, but it may not satisfy (1). If, for example, we have

(2)π(x)q(x,y)>π(y)q(y,x),
\pi(x)q(x,y) \gt \pi(y)q(y,x),

then we can adjust q by using a probability of move α(x,y). Transitions will be made using

p MH(x,y)q(x,y)α(x,y)
p_{MH}(x,y) \equiv q(x,y)\alpha(x,y)

for xy.

The choice of α follows the following logic. If (2) holds, then moves from x to y are happening too often under q. We should thus choose α(y,x)=1 . But then, in order to satisfy (1), we must have

π(x)q(x,y)α(x,y) =π(y)q(y,x)α(y,x) =π(y)q(y,x).
\begin{split} \pi(x)q(x,y)\alpha(x,y) &= \pi(y)q(y,x)\alpha(y,x) \\ &= \pi(y)q(y,x). \end{split}

This implies that

α(x,y)=π(y)q(y,x)π(x)q(x,y).
\alpha(x,y) = \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)}.

Conversely, we can consider the case when the inequality in (1) were reversed to derive α(y,x). Thus, to summarize, in order to satisfy reversibility, we set

(3)α(x,y)={min[π(y)q(y,x)π(x)q(x,y),1 ], ifπ(x)q(x,y)>0 1 , otherwise.
\alpha(x,y) = \begin{cases} \min\left[ \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)}, 1 \right], &\text{if}\; \pi(x)q(x,y) \gt 0 \\ 1, &\text{otherwise.} \end{cases}

Hence, the desired transition kernel is

(4)P MH(x,dy)=q(x,y)α(x,y)dy+[1 dq(x,y)α(x,y)dy]δ x(dy).
P_{MH}(x,dy) = q(x,y)\alpha(x,y)\,dy + \Bigl[ 1 - \int_{\mathbb{R}^d} q(x,y)\alpha(x,y)\,dy \Bigr] \delta_x(dy).

Thus, the Metropolis-Hastings algorithm is defined by the candidate-generating density q(x,y). Note that α(x,y) does not require knowledge of the normalizing constant because it drops out of the ratio π(y)/π(x). A special case arises when the candidate-generating density is symmetric: q(x,y)=q(y,x) since the probability of move reduces to π(y)/π(x). This special case forms the basis for optimization algorithms such as Simulated Annealing.

The algorithm proceeds as follows: Given an initial value x (0 ), for each j=1 ,2 ,,N

  1. Draw y from q(x (j),) and u from U(0,1 )
  2. If uα(x (j),y), set x (j+1 )=y.
  3. Otherwise, set x (j+1 )=x (j).