Introduction to Bayesian Parameter Estimation and Censored Data (Python Implementation)

In the previous post Bayesian parameter estimation was introduced in the context of right censored data. The probability distribution for the belief of a given estimate of the mean of the underlying distribution was explicitly derived. In this post the estimate will be implemented on a trial dataset generated and analyzed in python using the numpy and matplotlib libraries. The goal will be to answer the following data problem:

The Goal

What is the average weight of males in the US given our only scale is one that maxes out at 190 pounds?

In this example the actual distribution will have a mean of 187.5 pounds with a standard deviation of 25 pounds. A randomly generated data set of 1000 samples will be generated from the underlying distribution of which only the subset of samples below 190 pounds are retained in the final data set. The Python script will begin by first initializing variables and importing modules.

The Python Implementation

#!/usr/bin/env python
import matplotlib.pyplot as plt;
import numpy as np;
import matplotlib.mlab as mlab;
import math;
import scipy.optimize as opt;
import scipy.special as special;
mean = 187.5; variance = 25**2; rightCensor = 190.0; nSamples = 1000;
# precalculate some values
sigma = math.sqrt(variance); sigPi = sigma*np.sqrt(np.pi*2.0); variance2 = 2.0*variance

With these variables defined the functions to be used can now be defined. The following likelihood function will be implement in a python function named likelihood :

L = \log{p(\mu|X,\sigma)} \propto (-1) \cdot( N\log{\Phi(\frac{\lambda-\mu}{\sigma})} +\sum_{i=1}^N [\frac{(x-\mu)^2}{2\sigma^2} + \log{\sigma \sqrt{2\pi}}] )

The logNormNeg function will calculate the \frac{(x-\mu)^2}{2\sigma^2} term in the likelihood function. The cdfNorm function will calculate the \Phi(x) term. The percErr function will calculate the percentage error in our predicted mean values.

def cdfNorm(X):
    return (0.5)*( 1.0 + special.erf( np.true_divide(X,np.sqrt(2.0)) ) );
def logNormNeg(mu,sigma,X):
    p1 = np.power( np.subtract(X,mu), 2.0);
    p1 = np.true_divide( p1, (variance2) );
    return np.add( p1, np.log(sigPi) );
def likelihood(theta,X): #computes cost given predicted and actual values
    muLikely = theta[0];
    # the likelihood correction due to censoring
    censorFactor = float(len(X))*np.log( cdfNorm( ( rightCensor-muLikely ) / sigma ) );
    # the likelihood in the case of no censoring
    datFactor = np.sum( logNormNeg( muLikely, sigma, X ) );
    return ( datFactor + censorFactor );
def percErr(actual,obs):
    return np.abs(actual-obs)/np.abs(actual);

At this point the data can be generated using numpy's random.normal function and right censor it to obtain the final data set.

# sample from gaussian / normal
X = np.random.normal(mean, sigma, nSamples);
# right censor the data
X = X[ X <= rightCensor ]

To visualize the data set a histogram will be used. In the following we will use matplotlib's hist function to create this histogram.

# linear space for plots
edgeSpace = 0.5*(np.max(X)-np.min(X));
u = np.linspace(np.min(X)-edgeSpace,np.max(X)+edgeSpace,100);
# histogram the censored data
count, bins, ignored = plt.hist(X, 30, normed=True)

Due to the properties of a log function maximizing the likelihood function L is the equivalent of minimizing the negative log likelihood. The minimize function in the scipy module will be used to perform this minimization which obtains the maximum likelihood prediction of the mean of the right censored distribution.

# minimize the likelihood starting from guess at right censor edge
initial_theta = np.array([rightCensor])
minOut = opt.minimize(likelihood,initial_theta, args=(X))
meanMeasured = minOut['x'][0];
# plot likelihood estimate
print "Mean Estimate For Bayesian Likelihood Maximization :",meanMeasured," with error % :",(100.0*percErr(mean,meanMeasured)),"%"
plt.plot(u,mlab.normpdf(u,meanMeasured,sigma),label="Bayesian Likelihood")

Now the distribution will be plotted in the case the predicted mean is the data set mean as well as the actual distribution. Lastly a call to the show function will be made to display the resulting plots as required by Matplotlib.

# plot sample mean estimate
meanIn = np.mean(X)
print "Mean Estimate For Simple Sample Mean :",meanIn," with error % :",(100.0*percErr(mean,meanIn)),"%"
plt.plot(u,mlab.normpdf(u,meanIn,sigma),label="Sample Mean")
# plot actual
print "Actual Mean :",mean
# Label and show
plt.xlabel("x"); plt.ylabel("Probability");
plt.title("Bayesian Parameter Estimation For Right Censored Data ");

The Result

The following plot is the result of running this script. The bars in the histogram are larger than the underlying distribution because the bars are being normalized while already being cutoff by the right censoring. From the plot one observes that the estimate obtained from the Bayesian likelihood method is much better than the naive estimate obtained from the mean of the data set.


From the preceding analysis one observes that the mathematical work from the previous post pays off in a better estimate for the properties of the underlying distribution.



Leave a Reply

Your email address will not be published. Required fields are marked *