In the previous post Bayesian parameter estimation was introduced in the context of right censored data. The probability distribution for the belief of a given estimate of the mean of the underlying distribution was explicitly derived. In this post the estimate will be implemented on a trial dataset generated and analyzed in python using the numpy and matplotlib libraries. The goal will be to answer the following data problem:
What is the average weight of males in the US given our only scale is one that maxes out at 190 pounds?
In this example the actual distribution will have a mean of 187.5 pounds with a standard deviation of 25 pounds. A randomly generated data set of 1000 samples will be generated from the underlying distribution of which only the subset of samples below 190 pounds are retained in the final data set. The Python script will begin by first initializing variables and importing modules.
The Python Implementation
#!/usr/bin/env python import matplotlib.pyplot as plt; import numpy as np; import matplotlib.mlab as mlab; import math; import scipy.optimize as opt; import scipy.special as special; mean = 187.5; variance = 25**2; rightCensor = 190.0; nSamples = 1000; # precalculate some values sigma = math.sqrt(variance); sigPi = sigma*np.sqrt(np.pi*2.0); variance2 = 2.0*variance
With these variables defined the functions to be used can now be defined. The following likelihood function will be implement in a python function named likelihood :
The logNormNeg function will calculate the term in the likelihood function. The cdfNorm function will calculate the term. The percErr function will calculate the percentage error in our predicted mean values.
def cdfNorm(X): return (0.5)*( 1.0 + special.erf( np.true_divide(X,np.sqrt(2.0)) ) ); def logNormNeg(mu,sigma,X): p1 = np.power( np.subtract(X,mu), 2.0); p1 = np.true_divide( p1, (variance2) ); return np.add( p1, np.log(sigPi) ); def likelihood(theta,X): #computes cost given predicted and actual values muLikely = theta; # the likelihood correction due to censoring censorFactor = float(len(X))*np.log( cdfNorm( ( rightCensor-muLikely ) / sigma ) ); # the likelihood in the case of no censoring datFactor = np.sum( logNormNeg( muLikely, sigma, X ) ); return ( datFactor + censorFactor ); def percErr(actual,obs): return np.abs(actual-obs)/np.abs(actual);
At this point the data can be generated using numpy's random.normal function and right censor it to obtain the final data set.
# sample from gaussian / normal X = np.random.normal(mean, sigma, nSamples); # right censor the data X = X[ X <= rightCensor ]
To visualize the data set a histogram will be used. In the following we will use matplotlib's hist function to create this histogram.
# linear space for plots edgeSpace = 0.5*(np.max(X)-np.min(X)); u = np.linspace(np.min(X)-edgeSpace,np.max(X)+edgeSpace,100); # histogram the censored data count, bins, ignored = plt.hist(X, 30, normed=True)
Due to the properties of a log function maximizing the likelihood function is the equivalent of minimizing the negative log likelihood. The minimize function in the scipy module will be used to perform this minimization which obtains the maximum likelihood prediction of the mean of the right censored distribution.
# minimize the likelihood starting from guess at right censor edge initial_theta = np.array([rightCensor]) minOut = opt.minimize(likelihood,initial_theta, args=(X)) meanMeasured = minOut['x']; # plot likelihood estimate print "Mean Estimate For Bayesian Likelihood Maximization :",meanMeasured," with error % :",(100.0*percErr(mean,meanMeasured)),"%" plt.plot(u,mlab.normpdf(u,meanMeasured,sigma),label="Bayesian Likelihood")
Now the distribution will be plotted in the case the predicted mean is the data set mean as well as the actual distribution. Lastly a call to the show function will be made to display the resulting plots as required by Matplotlib.
# plot sample mean estimate meanIn = np.mean(X) print "Mean Estimate For Simple Sample Mean :",meanIn," with error % :",(100.0*percErr(mean,meanIn)),"%" plt.plot(u,mlab.normpdf(u,meanIn,sigma),label="Sample Mean") # plot actual print "Actual Mean :",mean plt.plot(u,mlab.normpdf(u,mean,sigma),'--',label="Actual") # Label and show plt.xlabel("x"); plt.ylabel("Probability"); plt.title("Bayesian Parameter Estimation For Right Censored Data "); plt.legend(loc=2) plt.show()
The following plot is the result of running this script. The bars in the histogram are larger than the underlying distribution because the bars are being normalized while already being cutoff by the right censoring. From the plot one observes that the estimate obtained from the Bayesian likelihood method is much better than the naive estimate obtained from the mean of the data set.
From the preceding analysis one observes that the mathematical work from the previous post pays off in a better estimate for the properties of the underlying distribution.