Announcements‎ > ‎

Bayesian analysis tutorial using PyMC

posted Jul 1, 2011, 2:54 AM by Olaf Bochmann
This is a short Bayesian analysis tutorial developed around the following problem: Consider the following dataset, which is a time series of recorded coal mining disasters in the UK from 1851 to 1962 [Jarrett, 1979].
The first step in the analysis is the Bayesian model building. We assume that the occurrences of the disasters can be modelled as a poisson process. Our hypothesis is, that at some point in time the mining process switched to a new technology, resulting in a lower rate of disasters in the later part of the time series. We define the following variables for the Bayesian stochastic model:
  • D_t: The number of disasters in year t.
  • r_t: The rate parameter of the Poisson distribution of disasters in year t. 
  • s: The year in which the rate parameter changes (switch point)
  • e: The early rate parameter prior to s.
  • l: The late rate parameter posterior to s.
  • t_l, t_h: The lower and upper boundaries of year t.
  • r_e,r_l: The rate parameters of the priors of the early and late rates.
The model can then be defined as 
where D is dependent on s, e, and l. In a Bayesian network the probabilistic variables s, e, and l are considered as 'parent' nodes od D and D is a 'child' node of s, e, and l. Similarly, the 'parents' of s are t_l and t_h, and s is the 'child' of t_l and t_h.

The nest step is fitting the probability model (linked collection of probabilistic variable) to the recorded mining disaster time series. This means we are trying to represent the posterior distribution. Markov chain Monte Carlo (MCMC) algorithm [Gamerman 1997] is the method of choice. In this case we represent the posterior p(s,e,l|D) by a set of joint samples of it. The MCMC sampler produces these samples by randomly updating the values of s, e and l for a number of iterations. This updating algorithm is called Metropolis-Hasting [Gelman et al. 2004].
With a reasonable large number of samples, the MCMC distribution of s, e and l converges to the stationary distribution. 

There are many general-purpose MCMC packages. Here I use PyMC - a python module created by David Huard and Anand Patil and Chris Fonnesbeck. I prefer a programming interface to stand alone programs like WinBugs for its flexibility. It uses high performance numerical libraries like numpy and optimized Fortran routines. The following code imports the model, instantiates the MCMC object and run the sampler algorithm:
>>> from pymc.examples import DisasterModel
>>> from pymc import MCMC
>>> M = MCMC(DisasterModel)
>>> M.isample(iter=10000, burn=1000, thin=10)
The iter-parameter specifies the number of updating iterations for the sampling algorithm. The algorithm assumes that the burn-parameter specifies a sufficient large number of iterations for convergence to stationarity of distribution s, e and l. To avoid strong autocorrelation among samples in the Markov chain, successive samples can be thinned. This means only every k-th sample is retained, where k is given by the thin-parameter. The number of samples for each variable is therefore (iter-burn)/thin.
Below are the 900 samples (left) of variable s - the year in which the rate parameter changed. The histogram (right) with a mean at 40 and 95% HPD interval (35, 44).
Next, the 900 samples (left) of variable e - the early rate parameter prior to s. The histogram (right) with a mean at 3.0 and 95% HPD interval (2.5, 3.7). 
Finally, the 900 samples (left) of variable l - the late rate parameter posterior to s. The histogram (right) with a mean at 0.9 and 95% HPD interval (0.7, 1.2). 
To conclude our analysis, the data of the UK mining disasters suggest that between the years 1886 and 1895 some technology may have been introduced, that lowered the disaster rate from approximately 3.0 to 0.9 per year. Well done Brits!