Metropolis Hastings Sampling is a method for obtaining samples for a known target probability distribution using samples from some other proposal distribution. It is similar to Rejection Sampling in providing a criteria for acceptance of a proposal sample as a target sample but instead of discarding the samples that do not meet the acceptance criteria the sample from the previous time step is replicated. Another difference is that Metropolis Hastings samples are modeled as a Markov Chain where the target distribution is the Markov Chain Equilibrium Distribution. As a consequence the previous sample is used as part of the acceptance criteria when generating the next sample. It will be seen this has the advantage of permitting adjustment of some proposal distribution parameters as each sample is generated, which in effect eliminates parameter inputs.
This is is an improvement over Rejection Sampling, where it was previously shown that slight variations in proposal distribution parameters can significantly impact performance. A downside of the Markov Chain representation is that autocorrelation can develop in the samples which is not the case in Rejection Sampling.

Algorithm

The Metropolis Sampling Algorithm was developed by physicist studying simulations of equation of state at the dawn of the computer age. It is the first a a class of sampling algorithms referred to a Monte Carlo Markov Chain. A couple of decades later the algorithm was generalized and given a firmer theoretical foundation becoming the Metropolis Hastings Sampling Algorithm. After a few more decades this work became widely used in simulating Bayesian Posterior Distributions in Probabilistic Programming DSLs such as pymc. This section provides an overview of the algorithm to motivate details of the theory and examples discussed in following sections.

Let f(y)f(y) denote the target distribution. The algorithm constructs a Markov Chain with equilibrium distribution, πE(y)\pi_{E}(y), such that,

f(y)=πE(y). f(y) = \pi_{E}(y).

The equilibrium stochastic kernel, p(x,y)p(x,y), for the Markov Chain must satisfy,

πE(y)=p(x,y)πE(x)dx     (1). \pi_{E}(y) = \int_{-\infty}^{\infty} p(x, y) \pi_{E}(x) dx\ \ \ \ \ (1).

To generate samples for p(x,y)p(x, y) samples from a proposal Markov Chain with stochastic kernel q(x,y)q(x, y) are produced. Let xx denote the state of the p(x,y)p(x, y) Markov Chain at the previous time step, t1t-1, and yy represent the yet to be determined state at time tt. Consider a proposal sample, yy^{\ast}, generated by q(x,y)q(x, y). Define an acceptance function, α(x,y)\alpha(x, y^{\ast}), that satisfies,

0  α(x,y) 1     (2), 0\ \leq\ \alpha(x, y^{\ast})\ \leq 1\ \ \ \ \ (2),

and an acceptance random variable, UU, with distribution Uniform(0, 1)\textbf{Uniform}(0,\ 1) independent of yy^{\ast}. Now, if the following inequality is satisfied,

U  α(x,y), U\ \leq\ \alpha(x, y^{\ast}),

accept the proposed sample, y=yy=y^{\ast}, as a sample of p(x,y)p(x, y). If the inequality is not satisfied then reject the sample by replicating the state from the previous time step, y=xy=x.

It will be shown that,

α(x,y)=min{f(y)q(y,x)f(x)q(x,y),1}     (3). \alpha(x, y) = \text{min}\left\{\frac{f(y)q(y,x)}{f(x)q(x,y)}, 1\right\}\ \ \ \ \ (3).

The algorithm can be summarized by the following steps that are repeated for each sample.

  1. Generate samples of yy^{\ast} conditioned on xx and independently generate samples of UU.
  2. If U  α(x,y)U\ \leq\ \alpha(x, y^{\ast}) then y=yy = y^{\ast}.
  3. If U > α(x,y)U\ >\ \alpha(x, y^{\ast}) then y=xy = x.

Theory

Metropolis Hastings Sampling is simple to implement but the theory that provides validation of the algorithm is more complicated than required for both Rejection Sampling and Inverse CDF Sampling. Here the proof that the algorithm produces the desired result is accomplished in four steps. The first will show that distributions and stochastic kernels that satisfy Time Reversal Symmetry, also referred to as Detailed Balance, are equilibrium solutions that satisfy equation (1).(1). Next, the stochastic kernel resulting from the algorithm is derived and followed by a proof that a distribution that satisfies Time Reversal Symmetry with the resulting stochastic kernel is a solution to equation (1).(1). Finally, the expression for α(x,y)\alpha(x,y) from equation (3)(3) will be derived.

Time Reversal Symmetry

Consider a stochastic kernel p(x,y)p(x,y) and distribution π(y)\pi(y) where xx is the state of the Markov Chain generated by the kernel at t1t-1 and yy the state at time tt. If the chain has NN steps the expected number of transitions from xx to yy will satisfy,

Ny=Nπ(y)p(x,y). N_{y}=N \pi(y) p(x, y).

It follows that the rate of transition from xx to yy is given by π(y)p(x,y)\pi(y) p(x,y). The transition rate for the Markov Chain obtained by reversing time, that is the chain transitions from yy to x,x, will be π(x)p(y,x)\pi(x) p(y, x).

Time Reversal Symmetry implies that the transition rate for the time reversed Markov Chain is the same as the forward time chain,

π(y)p(x,y)=π(x)p(y,x)     (3). \pi(y) p(x, y) = \pi(x) p(y,x)\ \ \ \ \ (3).

If Time Reversal Symmetry is assumed it follows that,

p(x,y)π(x)dx=p(y,x)π(y)dx=π(y)p(y,x)dx=π(y), \begin{aligned} \int_{-\infty}^{\infty} p(x, y) \pi(x) dx &= \int_{-\infty}^{\infty} p(y, x) \pi(y) dx \\ &= \pi(y) \int_{-\infty}^{\infty} p(y, x) dx \\ &= \pi (y), \end{aligned}

where the last step is a result of the definition of a stochastic kernel, p(y,x)dx=1\int_{-\infty}^{\infty} p(y, x) dx = 1. Thus any distribution and kernel satisfying equation (3)(3) is an equilibrium solution since it will also satisfy equation (1)(1)

Stochastic Kernel

In a previous post it was shown that a stochastic kernel can be interpreted as probability density conditioned of the state at the previous time step,

p(x,y)=f(yx). p(x,y) = f(y\mid x).

It follows that the Cumulative Probability Distribution (CDF) of obtaining a sample yy for a given xx is given by,

P[Y  y  X=x]=yf(wx)dw=yp(x,w)dw. \begin{aligned} P\left[ Y\ \leq\ y\ \mid\ X=x \right] &= \int_{-\infty}^{y} f(w\mid x) dw \\ &= \int_{-\infty}^{y} p(x,w) dw. \end{aligned}

Let UU represent the Uniform(0,1)\textbf{Uniform}(0, 1) acceptance random variable. A proposal sample, yy^{\ast} conditioned on xx, is generated by q(x,y)q(x,y) independent of UU and accepted if,

U  α(x,y), U\ \leq\ \alpha(x, y^{\ast}),

then y=yy=y^{\ast}. The proposal is rejected if,

U > α(x,y), U\ >\ \alpha(x, y^{\ast}),

then y=xy=x. Now, using the Law of Total Probability to condition on UU the stochastic kernel CDF becomes,

P[Y  y  X=x] = P[Y  y  U  α(x,y), X=x] P[U  α(x,y)  X=x] +P[Y  y  U > α(x,y), X=x] P[U > α(x,y)  X=x].     (4). \begin{aligned} P\left[ Y\ \leq\ y\ \mid\ X=x \right]\ &=\ P\left[Y\ \leq\ y\ \mid\ U\ \leq\ \alpha(x, y^{\ast}),\ X=x \right]\ P\left[U\ \leq\ \alpha(x, y^{\ast})\ \mid\ X=x \right] \\ &\ + P\left[Y\ \leq\ y\ \mid\ U\ >\ \alpha(x, y^{\ast}),\ X=x \right]\ P\left[U\ >\ \alpha(x, y^{\ast})\ \mid\ X=x \right]. \end{aligned}\ \ \ \ \ (4).

The first term is the probability that the proposed sample is accepted and the second term the probability that it is rejected. Consider the first term where U  α(x,y)U\ \leq\ \alpha(x, y^{\ast}) and y=yy=y^{\ast},

P[Y  y, U  α(x,y)  X=x] = P[Y  y  U  α(x,y), X=x] P[U  α(x,y)  X=x]     (5), P\left[Y\ \leq\ y,\ U\ \leq\ \alpha(x, y^{\ast})\ \mid\ X=x \right]\ =\ P\left[Y\ \leq\ y\ \mid\ U\ \leq\ \alpha(x, y^{\ast}),\ X=x \right]\ P\left[U\ \leq\ \alpha(x, y^{\ast})\ \mid\ X=x \right]\ \ \ \ \ (5),

which follows from the definition of Conditional Probability.

The algorithm assumes that UU is independent of yy^{\ast} conditioned on xx. It follows that the joint density is given by,

h(y,ux)=g(u)f(yx)=q(x,y), \begin{aligned} h(y^{\ast}, u\mid x) &= g(u)f(y^{\ast}\mid x) \\ &= q(x, y^{\ast}), \end{aligned}

where the last step follows by noting that since U  Uniform(0,1),U\ \sim\ \textbf{Uniform}(0,1), the density of UU is given by g(u)=1g(u)=1 and the density of yy^{\ast} conditioned on xx is given by the proposal stochastic kernel, f(yx)=q(x,y).f(y^{\ast}\mid x)=q(x, y^{\ast}). Using this result to continue gives the acceptance probability term for the stochastic kernel CDF from equation (5)(5),

P[Y  y, U  α(x,y)  X=x]=y0α(x,y)h(y,ux)dudy=y0α(x,y)q(x,y)dudy=yα(x,y)q(x,y)dy     (6). \begin{aligned} P\left[Y\ \leq\ y,\ U\ \leq\ \alpha(x, y^{\ast})\ \mid\ X=x \right] &= \int_{-\infty}^{y}\int_{0}^{\alpha(x, y^{\ast})} h(y^{\ast}, u\mid x) du dy^{\ast} \\ &= \int_{-\infty}^{y}\int_{0}^{\alpha(x, y^{\ast})} q(x, y^{\ast}) du dy^{\ast} \\ & = \int_{-\infty}^{y} \alpha(x, y^{\ast}) q(x, y^{\ast}) dy^{\ast} \end{aligned}\ \ \ \ \ (6).

Next, consider the second term in equation (4)(4), which represents the case where yy^{\ast} is rejected as a sample of yy. Since yy^{\ast} is rejected y=xy=x. It follows that any value of yy^{\ast} is possible so the integral over yy^{\ast} needs to be over its entire range. Using the previous result for the joint density h(y,ux)h(y^{\ast}, u\mid x). gives

P[Y =x, U > α(x,y)  X=x]=α(x,y)1h(y,ux)dudy=α(x,y)1q(x,y)dudy=[1α(x,y)]q(x,y)dy= q(x,y)dyα(x,y)q(x,y)dy=1α(x,y)q(x,y)dy     (7) \begin{aligned} P\left[Y\ = x,\ U\ >\ \alpha(x, y^{\ast})\ \mid\ X=x \right] &= \int_{-\infty}^{\infty}\int_{\alpha(x, y^{\ast})}^{1} h(y^{\ast}, u\mid x) du dy^{\ast} \\ &= \int_{-\infty}^{\infty}\int_{\alpha(x, y^{\ast})}^{1} q(x, y^{\ast}) du dy^{\ast} \\ &= \int_{-\infty}^{\infty}\left[ 1-\alpha(x, y^{\ast})\right] q(x, y^{\ast}) dy^{\ast} \\ &= \int_{-\infty}^{\infty}\ q(x, y^{\ast}) dy^{\ast} - \int_{-\infty}^{\infty}\alpha(x, y^{\ast}) q(x, y^{\ast}) dy^{\ast} \\ &= 1 - \int_{-\infty}^{\infty}\alpha(x, y^{\ast}) q(x, y^{\ast}) dy^{\ast} \end{aligned}\ \ \ \ \ (7)

since  q(x,y)dy=1\int_{-\infty}^{\infty}\ q(x, y^{\ast}) dy^{\ast}=1. Simplify things by defining the rejection probability density by,

fR(x)=1α(x,y)q(x,y)dy     (8). f_{R}(x) = 1 - \int_{-\infty}^{\infty}\alpha(x, y) q(x, y) dy\ \ \ \ \ (8).

Equation (7)(7) needs to be written as an integral over yy to continue with the evaluation of equation (4)(4). This is accomplished by use of the Dirac Delta Function which is defined by,

f(y)δ(yx)dy=f(x). \int_{-\infty}^{\infty} f(y^{\ast})\delta(y^{\ast}-x) dy^{\ast} = f(x).

If use is made of equation (8)(8) and the Dirac Delta Function it follows that equation (7)(7) becomes,

P[Y  y, U > α(x,y)  X=x]=yfR(y)δ(yx)dy     (9). P\left[Y\ \leq\ y,\ U\ >\ \alpha(x, y^{\ast})\ \mid\ X=x \right] = \int_{-\infty}^{y} f_{R}(y^{\ast})\delta(y^{\ast}-x) dy^{\ast}\ \ \ \ \ (9).

The upper range of the limit was changed to conform to the limits of the CDF. This is acceptable since if y < xy\ <\ x the probability of rejection is 0. The Metropolis Hastings stochastic kernel can now by constructed by assembling the results obtained in equations (6)(6) and (9)(9) and revisiting equation (5)(5),

yp(x,w)dw=P[Y  y  X=x]=P[Y  y, U  α(x,y)  X=x]+P[Y  y, U > α(x,y)  X=x]=yα(x,y)q(x,y)dy+yfR(y)δ(yx)dy=y[α(x,y)q(x,y)+fR(y)δ(yx)]dy \begin{aligned} \int_{-\infty}^{y} p(x,w) dw & = P\left[ Y\ \leq\ y\ \mid\ X=x \right] \\ &= P\left[Y\ \leq\ y,\ U\ \leq\ \alpha(x, y^{\ast})\ \mid\ X=x \right] + P\left[Y\ \leq\ y,\ U\ >\ \alpha(x, y^{\ast})\ \mid\ X=x \right] \\ &= \int_{-\infty}^{y} \alpha(x, y^{\ast}) q(x, y^{\ast}) dy^{\ast} + \int_{-\infty}^{y} f_{R}(y^{\ast})\delta(y^{\ast}-x) dy^{\ast} \\ &= \int_{-\infty}^{y} \left[ \alpha(x, y^{\ast}) q(x, y^{\ast}) + f_{R}(y^{\ast})\delta(y^{\ast}-x)\right] dy^{\ast} \end{aligned}

Finally, the desired result is obtained by collecting the terms under a single integration variable,

p(x,y)=α(x,y)q(x,y)+fR(y)δ(yx)fR(x)=1α(x,y)q(x,y)dy     (10) \begin{gathered} p(x,y) = \alpha(x, y) q(x, y) + f_{R}(y)\delta(y-x) \\ f_{R}(x) = 1 - \int_{-\infty}^{\infty}\alpha(x, y) q(x, y) dy \end{gathered}\ \ \ \ \ (10)

Verify that the kernel satisfies the stochastic property,

p(x,y)dy=[α(x,y)q(x,y)+fR(y)δ(yx)]dy=α(x,y)q(x,y)dyfR(x)=α(x,y)q(x,y)dy+1α(x,y)q(x,y)dy=1 \begin{aligned} \int_{-\infty}^{\infty} p(x,y) dy &= \int_{-\infty}^{\infty}\left[ \alpha(x, y) q(x, y) + f_{R}(y)\delta(y-x)\right] dy \\ &= \int_{-\infty}^{\infty}\alpha(x, y) q(x, y) dy - f_{R}(x) \\ &= \int_{-\infty}^{\infty}\alpha(x, y) q(x, y) dy + 1 - \int_{-\infty}^{\infty}\alpha(x, y) q(x, y) dy \\ &= 1 \end{aligned}

which is the desired result.

Equilibrium

It can now be shown that the Metropolis Hastings stochastic kernel from equation (10)(10) is an equilibrium solution by evaluating the integral from equation (1)(1) that defines equilibrium. Additionally, Time Reversal Symmetry must also be assumed,

π(x)α(x,y)q(x,y)=π(y)α(y,x)q(y,x)     (11), \pi(x)\alpha(x,y)q(x,y) = \pi(y)\alpha(y,x)q(y,x)\ \ \ \ \ (11),

where α(x,y)\alpha(x,y) is the acceptance function from equation (1)(1), q(x,y)q(x,y) is the stochastic kernel for the proposal Markov Chain, π(y)\pi(y) is the distribution, yy is the state of the target Markov Chain at time step tt and xx the state at time t1t-1.

Now, starting with equation (1)(1),

p(x,y)π(x)dx=[α(x,y)q(x,y)+fR(y)δ(yx)]π(x)dx=α(x,y)q(x,y)π(x)dx+fR(y)π(y)=α(x,y)q(x,y)π(x)dx+[1α(y,w)q(y,w)dw]π(y)=π(y)+α(x,y)q(x,y)π(x)dxπ(y)α(y,w)q(y,w)dw=π(y)+α(x,y)q(x,y)π(x)dxπ(w)α(w,y)q(w,y)dw=π(y), \begin{aligned} \int_{-\infty}^{\infty} p(x, y) \pi(x) dx &= \int_{-\infty}^{\infty}\left[ \alpha(x, y) q(x, y) + f_{R}(y)\delta(y-x)\right] \pi(x) dx \\ &= \int_{-\infty}^{\infty}\alpha(x, y) q(x, y) \pi(x) dx + f_{R}(y)\pi(y) \\ &= \int_{-\infty}^{\infty}\alpha(x, y) q(x, y) \pi(x) dx + \left[1 - \int_{-\infty}^{\infty}\alpha(y, w) q(y, w) dw \right]\pi(y) \\ &= \pi(y)+\int_{-\infty}^{\infty}\alpha(x, y) q(x, y) \pi(x) dx - \int_{-\infty}^{\infty}\pi(y)\alpha(y, w) q(y, w) dw \\ &= \pi(y)+\int_{-\infty}^{\infty}\alpha(x, y) q(x, y) \pi(x) dx - \int_{-\infty}^{\infty}\pi(w)\alpha(w, y) q(w, y) dw \\ &= \pi(y), \end{aligned}

where the last steps follow from substituting equation (11)(11) into the last term leading to the desired result.

Derivation of α(x,y)\alpha(x,y)

It has been shown that the Metropolis Hastings algorithm generates a Markov Chain with an equilibrium distribution equal to the target distribution. This has been accomplished by only requiring that the acceptance function, α(x,y)\alpha(x,y), satisfy Time Reversal Symmetry as specified by equation (11)(11) without giving and explicit form. Here an expression for α(x,y)\alpha(x,y) is derived with the aim of driving arbitrary proposal Markov Chains toward states satisfying Time Reversal Symmetry.

For an arbitrary proposal stochastic kernel, q(x,y)q(x,y), Time Reversal Symmetry will not be satisfied, so either π(x)q(x,y) > π(y)q(y,x)\pi(x)q(x,y)\ >\ \pi(y)q(y,x) or π(x)q(x,y) < π(y)q(y,x)\pi(x)q(x,y)\ <\ \pi(y)q(y,x) will be true. Assume the first condition is valid. If this is the case transitions from xx to yy occur more frequently than transitions from yy to xx. To correct for the imbalance the number of transitions from xx to yy needs to be decreased and the number of transition from yy to xx increased. This can be accomplished by setting α(y,x)=1\alpha(y,x)=1 in equation (11)(11) to obtain,

π(x)α(x,y)q(x,y)=π(y)q(y,x). \pi(x)\alpha(x,y)q(x,y) = \pi(y)q(y,x).

Solving this equation for α(x,y)\alpha(x,y) gives,

α(x,y)=π(y)q(y,x)π(x)q(x,y). \alpha(x,y) = \frac{\pi(y)q(y,x)}{\pi(x)q(x,y)}.

Similarly if the second condition, π(x)q(x,y) < π(y)q(y,x)\pi(x)q(x,y)\ <\ \pi(y)q(y,x), is satisfied by the proposal stochastic kernel the number of transitions from xx to yy needs to be increased and the transitions from yy to xx decreased. Setting α(x,y)=1\alpha(x,y)=1 in equation (11)(11) produces the desired result,

π(x)q(x,y)=π(y)α(y,x)q(y,x). \pi(x)q(x,y) = \pi(y)\alpha(y,x)q(y,x).

Solving for α(y,x)\alpha(y,x) gives,

α(y,x)=π(x)q(x,y)π(y)q(y,x), \alpha(y,x) = \frac{\pi(x)q(x,y)}{\pi(y)q(y,x)},

which is the time reversed version of the first result. Equation (3)(3) follows when the constraint, 0  α(x,y) 1,0\ \leq\ \alpha(x, y^{\ast})\ \leq 1, is considered,

α(x,y)=min{f(y)q(y,x)f(x)q(x,y),1}. \alpha(x, y) = \text{min}\left\{\frac{f(y)q(y,x)}{f(x)q(x,y)}, 1\right\}.

Example

Here an example implementation of the Metropolis Hastings Sampling algorithm using a Weibull\textbf{Weibull} target distribution and Normal\textbf{Normal} proposal distribution is discussed. The Weibull\textbf{Weibull} distribution PDF is defined by,

fX(x;k,λ)={kλ(xλ)k1e(x/λ)kx  00x<0     (12), f_X(x; k, \lambda) = \begin{cases} \frac{k}{\lambda}\left(\frac{x}{\lambda} \right)^{k-1} e^{-\left(x/\lambda\right)^k} & x\ \geq\ 0 \\ 0 & x < 0 \end{cases} \ \ \ \ \ (12),

where k > 0k\ >\ 0 is the shape parameter and λ > 0\lambda\ >\ 0 the scale parameter. The first and second moments are,

μ=λΓ(1+1k)σ2=λ2[Γ(1+2k)(Γ(1+1k))2]     (13), \begin{aligned} \mu & = \lambda\Gamma\left(1+\frac{1}{k}\right) \\ \sigma^2 & = \lambda^2\left[\Gamma\left(1+\frac{2}{k}\right)-\left(\Gamma\left(1+\frac{1}{k}\right)\right)^2\right] \end{aligned} \ \ \ \ \ (13),

where Γ(x)\Gamma(x) is the Gamma function. The following plot illustrates the variation in the Weibull\textbf{Weibull} distribution for a fixed value of the scale parameter of λ=1\lambda=1 as the shape parameter, kk, is varied. The following sections will assume that λ=1\lambda=1 and k=5k=5.

The Normal distribution is defined by the PDF,

fX(x;μ,σ2)=12πσ2e(xμ)2/2σ2     (14), f_X(x; \mu, \sigma^2) =\frac{1}{\sqrt{2\pi\sigma^2}}e^{(x-\mu)^2/2\sigma^2}\ \ \ \ \ (14),

where μ\mu is the location parameter and first moment and σ2\sigma^2 is the scale parameter and second moment. The following plot illustrates variation in the PDF as the scale and location parameter are varied.

Proposal Distribution Stochastic Kernel

The Metropolis Hastings Sampling algorithm requires a stochastic kernel based on the proposal distribution. A Normal\textbf{Normal} stochastic kernel can be derived from the difference equation,

Xt=Xt1+εt, X_{t} = X_{t-1} + \varepsilon_{t},

where t=0, 1, 2,t=0,\ 1,\ 2,\ldots and the εt\varepsilon_{t} are identically distributed independent Normal\textbf{Normal} random variables with zero mean and variance, σ2\sigma^2. Let y=Xty=X_{t} and x=Xt1x=X_{t-1} then the equation above becomes,

yx=εt. y-x=\varepsilon_{t}.

Substitution into equation (14)(14) gives the stochastic kernel,

q(x,y)=12πσ2e(yx)2/2σ2     (15), q(x,y) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-(y-x)^2/2\sigma^2}\ \ \ \ \ (15),

The form of q(x,y)q(x,y) is a Normal\textbf{Normal} distribution with mean μ=x\mu=x. This has eliminated μ\mu as a free parameter. The only arbitrary parameter is the standard deviation, σ\sigma. In the implementation σ\sigma will be referred to as the stepsize. The plot below shows how q(x,y)q(x,y) varies with each step in a simulation. The first five steps are shown for an initial condition X0=0X_{0}=0 and stepsize=1.0.

In the plot it is seen that the distribution is recentered at each step about the the previous value. This behavior is a consequence of the assuming a Normal\textbf{Normal} proposal distribution. Using another form for a proposal distribution could use a different parameterization.

A Rejection Sampling implementation using the same proposal and target distributions has both μ\mu and σ\sigma as arbitrary parameters. Modeling the sampling process as a Markov Chain has eliminated one parameter.

Implementation

Metropolis Hastings Sampling is simple to implement. This section will describe an implementation in Python using libraries available in numpy. Below a sampler for q(x,y)q(x,y) using the numpy normal random number generator is shown.

import numpy

def qsample(x, stepsize):
    return numpy.random.normal(x, stepsize)

qsample(x, stepsize) takes two arguments as input. The first is x, the previous state, and the second the stepsize. x is used as the loc parameter and stepsize the scale parameter in the call to the numpy normal random number generator.

Simulations of time series using qsample(x, stepsize) can be performed using the code below.

stepsize = 1.0
x0 = 0.0
nsamples = 500
x = numpy.zero(nsamples)

x[0] = x0
for j in range(1, nsamples):
    x[j] = qsample(x[j-1], stepsize)

The following plot shows time series generated by three separate runs of the code listed above.

The Python implementation of the algorithm summarized in the Algorithm section is shown below,

def metropolis_hastings(f, q, qsample, stepsize, nsample, x0):
    x = x0
    samples = numpy.zeros(nsample)
    for i in range(nsample):
        u = numpy.random.rand()
        y_star = qsample(x, stepsize)
        fy_star = f(y_star)
        fx = p(x)
        α = (fy_star*q(y_star, x, stepsize)) / (fx*q(x, y_star, stepsize))
        if u < α:
            x = y_star
        samples[i] = x
    return samples

metropolis_hastings(f, q, qsample, stepsize, nsample, x0) takes five arguments as input that are described in the table below.

Argument Description
f The target distribution which is assumed to support an interface taking a the previous state, x, as a floating point argument. In this example equation (12)(12) is used.
q The proposal stochastic kernel which is assumed to support an interface taking the previous state, x, and the stepsize both as floating point arguments. In this example equation (15)(15) is used.
qsample A random number generator based on the proposal stochastic kernel, q(x,y).q(x,y). It is assumed to support an interface taking the previous state, x, and the stepsize both as floating point arguments. The implementation used in this example is the function qsample(x, stepsize) previously discussed.
stepsize The scale parameter used by the proposal distribution.
nsample The number of samples desired.
x0 The initial target sample value.

The execution of metropolis_hastings(f, q, qsample, stepsize, nsample, x0) begins by allocation of storage for the result and initialization of the Markov Chain. A loop is then executed nsample times where each iteration generates the next sample. Within the loop the acceptance random variable with distribution Uniform(0, 1)\textbf{Uniform}(0,\ 1) is generated using the numpy random number generator. Next, a proposal sample is generated using qsample(x, stepsize) followed by evaluation of the acceptance function α(x,y)\alpha(x,y). Finally, the acceptance criteria is evaluated leading the sample being rejected or accepted.

Simulations

Here the results of simulations performed using,

 metropolis_hastings(f, q, qsample, stepsize, nsample, x0),

from the previous section are discussed. The simulations assume the Weibull\textbf{Weibull} target distribution from equation (12)(12) with λ=1\lambda=1 and k=5k=5 and the Normal\textbf{Normal} stochastic kernel from equation (14)(14). First, simulations that scan 4 orders of magnitude of stepsize with x0 fixed are compared with the goal of determining the value leading to the best performance. Next, simulations are performed with stepsize fixed that scan values of x0 to determine the impact of initial condition on performance. Finally, a method for removing autocorrelation from generated samples using Thinning is discussed.

The criteria for evaluating simulations include the time for convergence of the first and seconds moments computed from samples to the target distribution moments, the fit of the sampled distribution to the target PDF, the percentage of accepted proposal samples and the timescale for the decay of time series autocorrelation. It will be shown that stepsize is the only relevant parameter. The choice for x0 will be seen to have an impact that vanishes given sufficient time. Thus, the model has a single parameter of significance.

Performance

The plot below shows the percentage of proposal samples accepted as a function of stepsize for simulations with stepsize ranging from 10310^{-3} to 10110^{1}. For all simulations x0=1.0. The simulation believed to be the best performing accepted 82%82\% of the proposed samples and had a stepsize of 0.120.12. The two other simulations discussed in this section as representative of small and large values of stepsize had acceptance percentages of 99%99\% and 19%19\% for stepsize values of 0.010.01 and 1.331.33 respectively. Each of this simulations is indicated in the plot below.

On examination of the plot it is seen that for stepsize smaller than 0.120.12, the simulations to the left of the orange symbol, the accepted percentage of proposal samples very quickly approach 100%100\% while for stepsize larger than 0.120.12, the simulations to the right of the orange symbol, the accepted percentage approaches 0%0\% as a power law.

To get a sense of why this happens the stepsize is compared to the standard deviation of the target distribution computed from equation (13)(13), using the assumed values for λ\lambda and kk, gives 0.210.21. This value is the same order of magnitude of the best performing stepsize determined from simulations. Comparing the stepsize to the target standard deviation in the large and small limits provides an interpretation of the results. For small stepsize relative to the target standard deviation the proposal variance is much smaller then the target variance. Because of this the steps taken by the proposal Markov Chain are small leading to a exploration of the target distribution that never goes far from the initial value. The samples produced will have the proposal distribution since nearly all proposals are accepted. This is seen in the first of the following plots where time series examples for different stepsize values for the last 500500 steps for a 5100051000 sample simulation are shown. The small variance seen in the first plot is a consequence of the small stepsize leading to very small deviations from the initial value x0=1. In the opposite limit where the stepsize becomes large relative to the target standard deviation the steps taken by the proposal Markov Chain are so large that they are frequently rejected since low probability target events are oversampled. This effect is seen in last time series plot shown below. There long runs where the series has a constant value are clearly visible. The best value of stepsize relative to the target standard deviation occurs when they are about the same size. The middle plot below has the optimal stepsize of 0.120.12 and accepted 82%82\% of the proposed samples. It clearly has a more authentic look than the other two simulations which are at extreme values of the percentage accepted.

A measure of simulation performance is the rate of convergence of distribution moments computed from samples to the target distribution momements. The next two plots show convergence of the cumulative sample mean, μ\mu, and standard deviation, σ\sigma, to the target distribution values computed from equation (13)(13) for three values of stepsize that compare simulations at both the small and large extremes with the optimal stepsize of 0.120.12. For both μ\mu and σ\sigma the small stepsize example is clearly the worst performing. After 10510^5 time steps there is no indication of convergence. There is no significant difference between the other two simulations. Both are showing convergence after O(104)O(10^4) samples.

A consequence of using a Markov Chain to generate samples is that autocorrelation is built into the model. This is considered an undesirable artifact since in general independent and identically distributed samples are desired. It follows that there is a preference for rapid decorrelation of samples. The following plot shows the autocorrelation coefficient for the three values of stepsize previously discussed. The autocorrelation coefficient of a time series, ff, provides a measure of similarity or dependence of the series past and future and is defined by,

γτ=1σf2n=0t(fnμf)(fn+τμf), \gamma_{\tau} = \frac{1}{\sigma_{f}^2}\sum_{n=0}^{t} \left(f_{n} - \mu_f \right) \left(f_{n+\tau} - \mu_f\right),

where μf\mu_{f} and σf\sigma_{f} are the time series mean and standard deviation respectively. Calculation of autocorrelation was discussed in a previous post. The small stepsize simulation has a very slowly decaying autocorrelation. For time lags of up to 100 steps it has decayed by only a few precent. This behavior is expected since for small stepsize the resulting samples are from the proposal Markov Chain, with stochastic kernel shown in equation (15)(15), which has an autocorrelation coefficient independent of τ\tau given by γτ=1\gamma_{\tau}=1. The other simulations have a similar decorrelation rate of O(10)O(10) time steps, though for the larger stepsize the decorrelation rate is slightly faster.

The following three plots compare histograms computed from simulations of 10510^5 samples with the target Weibull\textbf{Weibull} distribution from equation (12)(12) for the same stepsize values used in the previous comparisons. The first plot shows the small stepsize simulation with a very high acceptance rate. For this case the simulated histogram is not close to reproducing the target distribution. The last plot is the large stepsize simulation with a very large rejection rate. When compared with optimal stepsize of 0.120.12 simulation shown in the middle plot the larger stepsize simulation is not as smooth but is acceptable. The degraded performance of larger stepsize simulation will be a consequence of rejection of many more proposal samples leading to long stretches of repeated values. For the optimal stepsize simulation almost 1010 times more samples are available in the histogram calculation.

In summary a collection of simulations using the Metropolis Hastings Sampling algorithm to sample a target Weibull\textbf{Weibull} distribution using a Normal\textbf{Normal} proposal distribution have been performed that scan the stepsize parameter for a fixed initial value of x0=1.0 in an effort to determine the best performing stepsize. Best performing was determined by considering the total percentage of accepted samples, the quality of the generated time series, the convergence of first and second moments to the known target values, the decorrelation timescale and fit of sample histograms to the target distribution. The best performing stepsize was found to have a value of 0.120.12 which is near the standard deviation of the target distribution. For values of stepsize smaller than the best performing stepsize the performance was inferior on all counts. The number of accepted proposals was high, the time series look like the proposal distribution, convergence of moments is slow, samples remain correlated over long time scales and the distributions computed from samples do not converge to the target distribution. When the same comparison is made to simulations with a larger stepsize the results were less conclusive. Larger values of stepsize accept fewer proposal samples. This causes degradation of the time series since there are many long runs of repeated values. In comparisons of convergence of distribution moments and autocorrelation there was no significant differences but calculation of the distribution using histograms on sample data were not as acceptable since an order of magnitude less data was used in the calculation because of the high rejection percentage.

Burn In

In the previous section simulations were compared that scanned stepsize with x0 fixed. Here simulations with stepsize fixed that scan x0 are discussed. The further x0 is from the target distribution mean the further displaced the initial state is from equilibrium so the expectation is that a longer time is required to reach equilibrium. If some knowledge of the target mean is available it should be used to minimize this relaxation period. The time required to reach equilibrium is referred to as burn in. Typically the burn in period of the simulation is considered and artifact and discarded with the aim of improving the final result. The expectation then is that the simulation is generating natural fluctuations about equilibrium.

The two plots above compare cumulative first and second moment calculations on simulated samples with the target Weibull\textbf{Weibull} distribution values calculated from equation (13)(13) using λ=1\lambda = 1 and k=5k=5. The obtained mean and standard deviation are 0.920.92 and 0.210.21 respectively. In the plots time to reach equilibrium in seen to increase with increasing displacement of x0 from 0.920.92. For the simulation at the most extreme value of x0 the time to reach equilibrium is a factor of 1010 longer than the simulation that reached equilibrium first. For standard deviation the effect is even larger increasing the relaxation time by nearly a factor of 10210^2 when compared to the fastest relaxation time seen. There is also a very large spike in the standard deviation that increases with displacement of x0 from the target mean.

The following plot of the autocorrelation coefficient shows that time for sampled time series to decorrelate is not impacted by the initial value of x0.

The next plot shows a comparison of the histogram computed from the worst performing simulation with x0=3.5 with the target distribution. The over representation of large low probability samples induced by the large displacement of the initial condition is apparent in the long right tail.

In the next series of plots a burn in period of 10410^4 time steps is assumed and the data is excluded from calculations. This corresponds to removing 20%20\% of the samples. The magnitude of the adjustment to equilibrium is significantly reduced. Within 10310^3 time steps all simulations for both moments are near their equilibrium values with no significant difference in the convergence rate. The clear increase in the time of convergence to equilibrium with displacement of x0 from the equilibrium value seen when all samples are included has been eliminated.

The final plot compares the sample histogram with burn in removed for the simulation with x0=3.5 to the target distribution. It is much improved from the plot in which burn in was retained. In fact it is comparable with the same plot obtained for a simulation where x0=1 above.

Thinning

It is desirable that sampling algorithms not introduce correlations between samples since random number generators for a given distribution are expected to produce independent and identically distributed. Since Metropolis Hastings Sampling models generated samples as a Markov Chain, autocorrelation is expected because the proposal stochastic kernel depends on the previously generated sample. Thinning is a method for reducing autocorrelation in sequence of generated samples. It is a simple algorithm that can be understood by considering a time series {xt}\{x_{t}\} samples of length NN. Thinning discards samples that do not satisfy t mod η=0t\ \mod\ {\eta}=0, where η\eta is referred to as the thinning interval. Discarding all but equally spaced samples increases the time between the retained samples which can reduce autocorrelation.

The impact of Thinning on the appearance of the time series is illustrated below. The first plot is from the original time series and the others illustrate increasing the thinning interval. The first point in each plot is the same and the plot range is adjusted so that all have 500500 samples. A tendency of series values to infrequently change from increasing to decreasing as time increases is an indication of autocorrelation. Comparison of the original series with the thinned series illustrates this. The plot with η=1\eta=1 may have runs of O(10)O(10) samples with a persistent increasing or decreasing trend which is consistent with decorrelation time of O(10)O(10) previously demonstrated. For increasing η\eta the length of increasing or decreasing trends clearly decreases.

The following plot confirms this intuition. The simulation samples are generated using Metropolis Hastings with a Weibull\textbf{Weibull} target distribution and Normal\textbf{Normal} proposal distribution with x0=1 and the previously determined best performing stepsize=0.12. The burn in period samples have also been discarded. A value of η=1\eta=1 corresponds to not discarding any samples. It is seen that the decorrelation time scale decreases rapidly with most of the reduction occurring by η=4.\eta=4.

Thinning has no impact on the rate of convergence of first and second moments computed from samples to the target distribution values. This is shown in the following two plots which have convergence rates of O(103)O(10^3) samples as seen in the previous section when only the burn in samples were discarded.

The following two plots compare histograms computed from samples with the target distribution for different values of the thinning interval. The first plot shows the original series and the second the result from thinned samples. The fit of the thinned plot is degraded but this will be due to the smaller number on samples. The original series consists of 4×1044\times 10^4 samples and the thinned series 6.6×1036.6\times 10^3. It follows that if fairly aggressive thinning is performed simulations may need to be run O(10)O(10) longer to obtain a sufficient set of samples.

The following plot shows the results of a simulation 5×1055\times10^5 samples that is comparable with the result obtained without thinning.

Conclusions

The Metropolis Hastings Sampling algorithm provides a general method for generating samples for a known target distribution using an available proposal sampler. The algorithm is similar to Rejection Sampling but instead of discarding rejected proposal samples Metropolis Hastings Sampling replicates the previous sample. It also has a more complicated acceptance function that assumes the generated samples are a Markov Chain. Metropolis Hastings Sampling is simple to implement but the theory describing how it works is more sophisticated than that encountered in both Inverse CDF Sampling and Rejection Sampling.

The theoretical discussion began by deriving the stochastic kernel following from the assumptions introduced by the rejection algorithm, equation (10)(10). Next, it was shown that if Time Reversal Symmetry were assumed for the stochastic kernel and target distribution, equation (11)(11), then the target distribution is the equilibrium distribution of the Markov Chain created by the stochastic kernel. Finally, the acceptance function, equation (3)(3), was derived using Time Reversal Symmetry.

A Python implementation of a Metropolis Hastings Sampler was presented and followed by discussion of an example using a Weibull\textbf{Weibull} distribution as target and a Normal\textbf{Normal} distribution as proposal. A parameterization of the proposal stochastic kernel using setpsize as the standard deviation of the proposal distribution was described. The results of a series of simulations that varied stepsize for a fixed initial state, x0, over several orders of magnitude were presented. It was shown that the best performing simulations have stepsize that is about the same as the standard deviation of the proposal distribution. This was followed by simulations that scanned x0 for the stepsize determined as best. It was shown that the best performing simulations have x0 near the mean of the target distribution but that if the first 10410^4 samples of the simulation were discarded x0 can differ significantly from the target mean with no impact. Autocorrelation of samples is a consequence of using a Markov Chain to model samples and considered an undesirable artifact since independent identically distributed samples are desired. Autocorrelation can be removed by Thinning the samples. It was shown that thinning intervals of about η=5\eta=5 can produce samples with a very short decorrelation time at the expense of requiring generation of more samples.

The best performing simulations using Metropolis Hastings Sampling have an acceptance rate of about 80%80\% and converge to target values in about O(103)O(10^3) time steps. This is the same performance obtained using Rejection Sampling with a Normal\textbf{Normal} proposal distribution. The advantage offered by Metropolis Hastings Sampling is that it has a single arbitrary parameter. Rejection sampling has two arbitrary parameters that can significantly impact performance for small changes. It follows that Metropolis Hastings Sampling can be considered more robust.