Jason Blevins’s Notebook
Doucet de Freitas and Gordon (2001)
Citation
These notes are based on the following article:
Doucet, Arnaud, Nando de Frietas, and Neil Gordon (2001): “An Introduction to Sequential Monte Carlo Methods,” in Sequential Monte Carlo Methods in Practice, ed. by A. Doucet, N. de Freitas, and N. Gordon. New York: Springer-Verlag.
Introduction
Real-world data analysis often requires estimating unknown quantities given only a sequence of observations on some related observable quantity. In a Bayesian framework, one typically has a priori knowledge of the model: a prior distribution of the unobservable quantity of interest and likelihood functions which relate the observables to the unobservables. The resulting posterior distribution of the unobservables can be calculated using Bayes’ theorem. This allows one to conduct inference about the unobserved quantities.
In some cases, it is natural to process observations sequentially. These cases are the focus of this introductory article and the rest of the book. For example, on-line applications such as radar tracking or estimating financial instruments where new data become available in real time, it is easier to update the previously formed posterior distribution than to recalculate it from scratch.
Sequential Monte Carlo methods are simulation-based methods for calculating approximations to posterior distributions. They avoid making linearity or normality assumptions required by related methods such as the Kalman filter.
Model
Let denote the sequence of unobserved states, with . The authors consider only the case when is a Markov process with initial distribution . The Markov assumption is not needed and is only for tractability. Let denote the transition equation. Let denote a sequence of observations with . Suppose that, given , the observations are conditionally independent. Let denote the marginal distribution, or the observation equation.
Thus, the model is defined by the following distributions:
\begin{gathered}
p(x_0), \\
p(x_t \vert x_{t-1}) \quad \text{for} \quad t \geq 1, \\
p(y_t \vert x_t) \quad \text{for} \quad t \geq 1.
\end{gathered}
Define and . Our goal is to estimate the posterior distribution recursively. We also may care about the marginal distribution or an expectation such as
I(h_t) \equiv E[h_t(x_{0:t}) \vert y_{1:t}]
= \int h_t(x_{0:t}) p(x_{0:t} \vert y_{1:t})\,dx_{0:t}
for some that is integrable with respect to .
Bayes’ theorem gives us the posterior distribution at any time :
p(x_{0:t} \vert y_{1:t}) = \frac{p(y_{1:t} \vert x_{0:t})p(x_{0:t})}{\int p(y_{1:t} \vert x_{0:t})p(x_{0:t})\,dx_{0:t}}
There is an inherent recursive relationship for updating the posterior distribution. We can express in terms of :
\begin{aligned}
p(x_{0:t+1} \vert y_{1:t+1}) &=
\frac{p(x_{0:t+1}, y_{1:t+1})}{p(y_{1:t+1})} \\
&= \frac{p(x_{t+1}, y_{t+1} \vert x_{0:t}, y_{0:t})p(x_{0:t}, y_{1:t})}{p(y_{t+1} \vert y_{1:t})p(y_{1:t})} \\
&= \frac{p(y_{t+1} \vert x_{t+1})p(x_{t+1} \vert x_t}
{p(y_{t+1} \vert y_{1:t})} p(x_{0:t} \vert y_{1:t}) \\
\end{aligned}
where the last equality uses the definition of the conditional density, the conditional independence assumption, and the Markov assumption. The expression in the denominator is constant with respect to .
Perfect Monte Carlo Sampling
In the ideal situation, we could draw a sample of size from , the posterior distribution of interest. Call this sample . We can form an estimate of using the empirical distribution of the sample:
P_N(dx_{0:t} \vert y_{0:t}) = N^{-1} \sum_{i=1}^N \delta_{x_{0:t}^{(i)}}(dx_{0:t}).
From here, we can estimate the integral by
\int h_t(x_{0:t}) P_N(dx_{0:t} \vert y_{1:t}) = N^{-1} \sum_{i=1}^N h_t(x_{0:t}^{(i)}).
Typically it is impossible to get such a sample since is multivariate, known only up to a constant of proportionality, and non-standard. Markov chain Monte Carlo methods may be used in similar situations, but they are not well-suited to recursive problems such as the ones considered here.
Importance Sampling
Since we can’t draw from directly, we can use importance sampling to draw from indirectly using an importance sampling density and a corresponding importance weight for each draw. The importance sampling density must be chosen so that its support includes the support of . We have
\begin{aligned}
I(h_t) \equiv E[h_t(x_{0:t}) \vert y_{1:t}]
&= \int h_t(x_{0:t}) p(x_{0:t} \vert y_{1:t})\,dx_{0:t} \\
&= \frac{\int h_t(x_{0:t}) p(x_{0:t} \vert y_{1:t})\,dx_{0:t}}
{\int p(x_{0:t} \vert y_{1:t})\,dx_{0:t}} \\
&= \frac{\int h_t(x_{0:t}) w(x_{0:t}) \pi(x_{0:t} \vert y_{1:t})\,dx_{0:t}}
{\int w(x_{0:t}) \pi(x_{0:t} \vert y_{1:t})\,dx_{0:t}}
\end{aligned}
where
w(x_{0:t}) \equiv \frac{p(x_{0:t} \vert y_{1:t})}{\pi(x_{0:t} \vert y_{1:t})}
is the importance weight.
Thus, to approximate the expectation , we simply need to draw a sample from and calculate
I(h_t) = \frac{N^{-1} \sum_i h_t(x_{0:t}) w(x_{0:t}^{(i)})}
{N^{-1} \sum_i w(x_{0:t}^{(i)})}
= \sum_i h_t(x_{0:t}) \tilde{w}_t^{(i)}
where
\tilde{w}_t^{(i)} \equiv
\frac{w(x_{0:t})}{\sum_j w(x_{0:t}^{(j)})}
are the normalized importance weights.
This procedure can be interpreted as a sampling method with an approximate posterior given by
\hat P_N(dx_{0:t} \vert y_{1:t}) = \sum_i w_t^{(i)} \delta_{x_{0:t}}(dx_{0:t}).
We can then approximate by
\hat I(h_t) = \int h_t(x_{0:t}) \hat P_N(dx_{0:t} \vert y_{1:t}).
Although this solves the problem of being able to obtain draws, this approach is still not well-suited for our recursive problem.
Sequential Importance Sampling (SIS)
The goal of Sequential Importance Sampling (SIS) is to compute an estimate of using the past simulated values . We have the following recursive representation of the importance sampling distribution:
\begin{aligned}
\pi(x_{0:t} \vert y_{1:t})
&= \pi(x_0) \prod_{k=1}^t \pi(x_k \vert x_{0:k-1}, y_{0:k}).
\end{aligned}
This recursive relationship allows us to calculate the importance weights recursively:
\begin{aligned}
w(x_{0:t}) &\equiv \frac{p(x_{0:t} \vert y_{1:t})}{\pi(x_{0:t} \vert y_{1:t})} \\
&= \frac{ p(x_{0:t}, y_{1:t}) }
{ p(y_{1:t}) \pi(x_{0:t-1} \vert y_{1:t-1})
\pi(x_t \vert x_{0:t-1}, y_{1:t}) } \\
&= \frac{ p(x_{0:t-1}, y_{1:t-1})
p(x_t, y_t \vert x_{0:t-1}, y_{1:t-1}) }
{ p(y_t \vert y_{1:t-1})p(y_{1:t-1})
\pi(x_{0:t-1} \vert y_{1:t-1})
\pi(x_t \vert x_{0:t-1}, y_{1:t}) } \\
&= \frac{ p(x_{0:t-1} \vert y_{1:t-1}) }
{ \pi(x_{0:t-1} \vert y_{1:t-1}) }
\frac{ p(y_t \vert x_t) p(x_t \vert x_{0:t-1}) }
{ p(y_t \vert y_{1:t-1}) \pi(x_t \vert x_{0:t-1}, y_{1:t}) } \\
&= w(x_{0:t-1}) \frac{ p(y_t \vert x_t) p(x_t \vert x_{0:t-1}) }
{ p(y_t \vert y_{1:t-1}) \pi(x_t \vert x_{0:t-1}, y_{1:t}) }. \\
\end{aligned}
Hence,
\begin{aligned}
\tilde{w}(x_{0:t}^{(i)}) &= \frac{w(x_{0:t}^{(i)}}{\sum_j w(x_{0:t}^{(j)}} \\
&= \frac{
w(x_{0:t}^{(i)})
\frac{
p(y_t \vert x_t^{(i)}) p(x_t^{(i)} \vert x_{0:t-1}^{(i)})
}{
p(y_{1:t} \vert y_{1:t-1})
\pi(x_t^{(i)} \vert x_{0:t-1}^{(i)}, y_{1:t})
}
}{
\sum_j
w(x_{0:t}^{(j)})
\frac{
p(y_t \vert x_t^{(j)}) p(x_t^{(j)} \vert x_{0:t-1}^{(j)})
}{
p(y_{1:t} \vert y_{1:t-1})
\pi(x_t^{(j)} \vert x_{0:t-1}^{(j)}, y_{1:t})
}
}.
\end{aligned}
Notice that the term is constant for all and cancels out.
A special case occurs when
\pi(x_{0:t} \vert y_{1:t}) = p(x_{0:t})
= p(x_0) \prod_{k=1}^k p(x_k \vert x_{k-1})
so that
w(x_{0:t}) \propto w(x_{0:t-1}) p(y_t \vert x_t)
and
\tilde{w}_t^{(i)} \propto w_{t-1}^{(i)} p(y_t \vert x_t^{(i)}).
SIS is usually inefficient for high-dimensional integrals because as , the importance weights for some particles quickly approaches zero.
The Bootstrap Filter
To prevent particle degeneracy, the bootstrap filter introduces a resampling step which eliminates particles with low importance weights. Instead of using the importance-weighted empirical distribution we use a uniformly-weighted distribution
P_N(dx_{0:t} \vert y_{1:t}) = N^{-1} \sum_i N_t^{(i)}
\delta_{x_{0:t}^{(i)}}(dx_{0:t}),
where is the number of offspring of the particle which is to be determined by some branching mechanism. The most common such mechanism involves resampling times from (Gordon et al., 1993). We require for all . If , the particle dies. We want to choose such that
\int h_t(x_{0:t}) P_N(dx_{0:t} \vert y_{1:t}) \approx \int h_t(x_{0:t})
\hat{P}_N(dx_{0:t} \vert y_{1:t}).
The surviving particles, those with , are approximately distributed according to .
Algorithm
- Initialization ().
- For , draw .
- Set .
- Importance Sampling Step
-
For , draw and set .
-
For , calculate the importance weights and normalize them.
- Resampling Step