This app lets you see realtime bitcoin price information in your menu bar that runs at the top of your OS X screen. Keep track of bitcoin in the exchange of your choice any time of the day and have peace of mind. You have the option of choosing the exchange ( Coinbase / BTC-E ), frequency of updates (up to a limit) , and even remove the icon if you want.

]]>This app lets you see realtime stock information in your menu bar that runs at the top of your OS X screen. Keep track of a stock of your choice any time of the day and have peace of mind. You have the option of choosing the stock, frequency of updates (up to a limit) , and even remove the icon if you want.

]]>

This app takes a directory with PDFs and outputs a corresponding directory that contains the same PDFs labeled with groupings based on the text of the PDFs. It also supports renaming of PDFs based on some of the initial text so that your research papers don't have to be named by number.

This is useful for research papers from arXiv that tend to come in a numbered format for the downloaded PDF. Using the renaming option of this Mac app you can then end up with a useful name for the file plus a useful rename.

It lets you go from the following files

And get the following:

]]>

**The Goal**

*What is the average weight of males in the US given our only scale is one that maxes out at 190 pounds?*

In this example the actual distribution will have a mean of 187.5 pounds with a standard deviation of 25 pounds. A randomly generated data set of 1000 samples will be generated from the underlying distribution of which only the subset of samples below 190 pounds are retained in the final data set. The Python script will begin by first initializing variables and importing modules.

**The Python Implementation**

```
#!/usr/bin/env python
import matplotlib.pyplot as plt;
import numpy as np;
import matplotlib.mlab as mlab;
import math;
import scipy.optimize as opt;
import scipy.special as special;
mean = 187.5; variance = 25**2; rightCensor = 190.0; nSamples = 1000;
# precalculate some values
sigma = math.sqrt(variance); sigPi = sigma*np.sqrt(np.pi*2.0); variance2 = 2.0*variance
```

With these variables defined the functions to be used can now be defined. The following likelihood function will be implement in a python function named *likelihood* :

The *logNormNeg* function will calculate the term in the likelihood function. The *cdfNorm* function will calculate the term. The* percErr* function will calculate the percentage error in our predicted mean values.

```
def cdfNorm(X):
return (0.5)*( 1.0 + special.erf( np.true_divide(X,np.sqrt(2.0)) ) );
def logNormNeg(mu,sigma,X):
p1 = np.power( np.subtract(X,mu), 2.0);
p1 = np.true_divide( p1, (variance2) );
return np.add( p1, np.log(sigPi) );
def likelihood(theta,X): #computes cost given predicted and actual values
muLikely = theta[0];
# the likelihood correction due to censoring
censorFactor = float(len(X))*np.log( cdfNorm( ( rightCensor-muLikely ) / sigma ) );
# the likelihood in the case of no censoring
datFactor = np.sum( logNormNeg( muLikely, sigma, X ) );
return ( datFactor + censorFactor );
def percErr(actual,obs):
return np.abs(actual-obs)/np.abs(actual);
```

At this point the data can be generated using numpy's *random.normal function* and right censor it to obtain the final data set.

```
# sample from gaussian / normal
X = np.random.normal(mean, sigma, nSamples);
# right censor the data
X = X[ X <= rightCensor ]
```

To visualize the data set a histogram will be used. In the following we will use matplotlib's *hist* function to create this histogram.

```
# linear space for plots
edgeSpace = 0.5*(np.max(X)-np.min(X));
u = np.linspace(np.min(X)-edgeSpace,np.max(X)+edgeSpace,100);
# histogram the censored data
count, bins, ignored = plt.hist(X, 30, normed=True)
```

Due to the properties of a log function maximizing the likelihood function is the equivalent of minimizing the negative log likelihood. The* minimize* function in the scipy module will be used to perform this minimization which obtains the maximum likelihood prediction of the mean of the right censored distribution.

```
# minimize the likelihood starting from guess at right censor edge
initial_theta = np.array([rightCensor])
minOut = opt.minimize(likelihood,initial_theta, args=(X))
meanMeasured = minOut['x'][0];
# plot likelihood estimate
print "Mean Estimate For Bayesian Likelihood Maximization :",meanMeasured," with error % :",(100.0*percErr(mean,meanMeasured)),"%"
plt.plot(u,mlab.normpdf(u,meanMeasured,sigma),label="Bayesian Likelihood")
```

Now the distribution will be plotted in the case the predicted mean is the data set mean as well as the actual distribution. Lastly a call to the *show* function will be made to display the resulting plots as required by Matplotlib.

```
# plot sample mean estimate
meanIn = np.mean(X)
print "Mean Estimate For Simple Sample Mean :",meanIn," with error % :",(100.0*percErr(mean,meanIn)),"%"
plt.plot(u,mlab.normpdf(u,meanIn,sigma),label="Sample Mean")
# plot actual
print "Actual Mean :",mean
plt.plot(u,mlab.normpdf(u,mean,sigma),'--',label="Actual")
# Label and show
plt.xlabel("x"); plt.ylabel("Probability");
plt.title("Bayesian Parameter Estimation For Right Censored Data ");
plt.legend(loc=2)
plt.show()
```

**The Result**

The following plot is the result of running this script. The bars in the histogram are larger than the underlying distribution because the bars are being normalized while already being cutoff by the *right censoring*. From the plot one observes that the estimate obtained from the Bayesian likelihood method is much better than the naive estimate obtained from the mean of the data set.

From the preceding analysis one observes that the mathematical work from the previous post pays off in a better estimate for the properties of the underlying distribution.

-j

]]>Bayes theorem can provide a framework for parameter estimation that allows for the estimation of parameters in a flexible way that starts from simple principles. This is appealing for people who do not enjoy memorizing formulas and deriving an estimate from first principles.

The model building all starts from the following formula called Bayes theorem similar to how classical mechanics in physics starts from Newtons Laws.

In this formula the posterior defined by the term is the probability of our model M. The model probability is proportional to the prior belief in the model (the term) and the likelihood of the data given our model (the term). From this equation we can derive ordinary least squares regression which is familiar to anyone who has analyzed a simple data set by given the following 3 assumption; the error in our model otherwise known as the residuals follow a Normal (Gaussian) distribution, an uniform prior belief of our model, and that elements in our data are statistically independent. Typically it is also assumed that the variance in the data set is fixed across measurements but in this Bayesian framework that assumption can be relaxed.

**What does one mean by censored data?**

In this post the details of applying this Bayesian framework will be explored for the problem of estimating the mean of a distribution given we can only get samples which are below a given threshold. In an upcoming post we will implement the theoretical results obtained to the following mean weight estimation problem. What is the average weight of males in the US given our only scale is one that maxes out at 190 pounds? To answer this one could model the data using the following Bayesian logic assuming that male weight is distributed like a bell curve. This example of not being able to measure values above a certain threshold is known as* right censoring . *

Assuming that the mean US male weight is 187.5 pounds with a standard deviation of 25 pounds the goal will be to estimate the mean male weight with 1000 measurements from our 190 pound maximized scale. To apply Bayes theorem the likelihood given a measurement must be determined.

Where is the scale maximum which will be 190 pounds and is the th measurement obtained from our scale. For the case where our scale has effectively no maximum value this function effectively simply a Normal (Gaussian) function.

**A simple discrete example for calculating conditional probabilities**

To obtain the form of the likelihood we will start with a simple discrete example given by rolling a loaded dice with the following probabilities for each value.

The probability of rolling a 1 given that the roll is below 4 mathematically notated as will be calculated. The total probability for all possible rolls is equal to 1.

Similarly the conditional probability across all the cases in that satisfy the constraint . These constraint satisfying rolls are given by the value 1, 2, 3 .

The probability of can be easily calculated as the sum of the unconditional probabilities for 1, 2, 3. This value of is . Intuitively this number will be proportional to the conditional probability of interest. This proportionality along with the constraint that the probabilities are normalized leads to the following conditional probability for

**Applying Bayes theorem to parameter estimation : What is the mean of the underlying distribution?**

Applying the form obtained in the discrete example to the continuous case leads to the following for a limiting value of which for our example of the weight limited scale is given by 190 pounds.

Now we apply this formula to our weight example where for the th measurement is given by a Normal (Gaussian) function.

Let us define the cumulative distribution function for a standard normal distribution (,) as .

If we apply Bayes theorem with a uniform prior distribution we obtain the following posterior for a complete data set of measurements .

We will take the log of this posterior and find the value of the mean which has the maximum probability given the data by maximizing this log likelihood.

The most likely value of the mean is given by the following equation.

We now have the mathematical equation for the likelihood along with the constraint for the estimate of the mean which has the maximum probability and is therefore the most likely value. In an upcoming post we will implement this theoretical result in Python to obtain the estimate of the mean male weight with 1000 measurements from our 190 pound maximized scale from data simulated using the assumption that the mean US male weight is 187.5 pounds with a standard deviation of 25 pounds.

--j

]]>The goal will be to perform classification of online job listings such as the following at http://www.simplyhired.com/job/survivability-airframe-integration-engineer-job/northrop-grumman/gryhetooim .

We want to design a system that will take this input text data and classify it as text for a defense contractor engineering job listing.

The survivability integration engineer will utilize 1D, 2D, and 3D computational electromagnetic codes such as method of moments (MoM) and physical optics (PO) to perform component and vehicle level design and analysis and ensure system survivability. Engineer will perform survivability testing and data post-processing using Knowbell or Pioneer to verify survivability design of antennas, propulsion systems, and vehicle features such as edges, wing folds, gaps and door seals. Engineer will perform testing at both indoor and outdoor testing facilities. As an Engineer 3, the engineer will perform his/her tasks under the guidance of a senior engineer with general instruction. The survivability integration engineer will be a part of an integrated product team (IPT), supporting multi-discipline advanced development programs and technology research and development (R&D). Qualifications Basic Qualifications: A minimum of a bachelor's degree in electrical, mechanical, or aerospace engineering, or physics, and five years of relevant work experience is required for this position and a minimum of 2 years of applicable experience in Applied Electromagnetics or RF survivability design. Candidates with a masters degree in any of the areas listed above are required to have at least three years of experience. Candidates with a PhD in the areas listed above having no industry experience may apply for this position if their research or post-doctoral work is applicable. The ability to obtain and maintain a Department of Defense top secret clearance and special access program clearances is required for this position. Preferred Qualifications: Preference will be given to candidates with degrees in electrical engineering. Coursework in applied electro-physics, finite element analysis, or aircraft design is desirable. Candidates with experience in signature integration, RF survivability design, analysis, and testing are highly desired. Experience with Method of Moments codes such as HFSS or COMSOL, and shooting and bouncing ray codes such as Xpatch are highly desired. Experience with RCS data processing tools, such as Knowbell or Pioneer are highly desired. Computer-aided drafting (CAD) skills with, such as Catia or UG are a plus. Preference will be given to candidates with excellent verbal and written communication skills. An active secret or top clearance is highly preferred.

Let's now begin with a quick introduction to support vector machines.

**A minimal introduction to support vector machines**

Support vector machines performs classification by drawing a decision boundary such as a line in two dimensions or a hyper plane in higher dimensions. This decision boundary determines the predicted class of the input data. In the preceding figure the prediction would determine if the input data should be classified as a circle or a square. The key question becomes how do we determine where to draw this line. In support vector machines the idea is to draw the line such that the white empty space is as large as possible while dividing the data. This white empty space will be referred to as the "margin" and the goal will be to maximize this margin.

We can use linear algebra to determine the length of the margin as 2/||w|| where w is a vector orthogonal/perpendicular to the decision boundary. The decision itself is given by the algebraic sign of the function

f(x) = w * x - b

The problem now becomes an optimization problem of determining the maximum 2/||w|| given the input data. Optimization has a whole industry devoted to it and for this reason we will skip the details of performing this optimization.

**Obtaining a training set**

To build a model input data is required but sadly this data is not readily available in an open data set therefore it will have to be scraped from online sources and preprocessed using some basic natural language processing (NLP) techniques such as the removing of stopwords (http://en.wikipedia.org/wiki/Stop_words) . To perform this task a web scraper was written in python using a collection of python libraries such as urllib, beautiful-soup4, and nltk . The data was scraped from the result of web searches on the job listings site SimplyHired. The search terms where the occupations which would be the target class labels.

The result was a single line of text like the following for a result from an accounting specialist web search on simply hired job listing .

We currently great opportunity finance industry Experis seeking Accounting responsible wide range input processing core accounting system Great Essential Duties Responsibilities Assist processing lease bookings lease end lease processing cash receipts lease discounting etc Process periodic billing daily cash receipts Respond wide range requests documentation group sales staff customers Required Knowledge Skills Abilities Associate degree Accounting plus years experience solid general accounting knowledge leasing industry experience Experience Crystal Reports plus Proficient Excel Word Ability learn specific software applications Attention detail strong organizational skills Ability handle wide variety tasks Ability interface customers project professional image times Flexibility handle uneven nature work load deadlines Experis Who We Are As leader project solutions professional talent resourcing contract permanent positions Experis matches professionals rewarding Finance IT Engineering opportunities industry leading organizations helping accelerate careers delivering excellent results client companies Experis part world leader innovative workforce solutions Experis Benefits Expand connections Grow experience skills Accelerate career And even work us Experis professionals earn comprehensive benefits industry Along competitive pay benefits may include dental vision disability life insurance flexible spending accounts employee stock purchase plan holiday pay thousands online training courses develop current skills explore something new interviewing consulting advice job search tips helpful tools Learn Experis jobs Experis Equal Opportunity Employer

**Training the support vector machine model**

The first step of the modeling the text is to *vectorize* the text data which maps the text data into a multidimensional space where we can use hyper planes to decide on a prediction for the occupation label of the text data. A common approach which will be used here is to vectorize text data is using tf-idf to help vectorize the data (http://en.wikipedia.org/wiki/Tf%E2%80%93idf) . Due to the nature of text these vectors will be sparse vectors which eliminates the practicality of using other classification techniques more suitable for dense vector data like random forest classification. The labels for the occupation labels will be mapped into integers by using a dictionary to map the integer keys to the actual text of the occupation.

The input vector data can then be model using support vector machines as desired. This was implemented using python's scikit-learn library. The result is a model which can take the input job listing and classify it into an occupation category. However inputting data requires knowledge of python I/O therefore a simple website is needed to allow for *demoing* the model by the input of a webpage url to classify the data.

**The Django webapp**

To actually allow webpage urls to be easily inputted a django webapp will be deployed. This Django app will interface with the underlying model by fetching the input url and pre-processing it into a form which the underlying SVM model can use to make a prediction. This prediction can then be displayed back to the user. The following is a screenshot from the resulting webapp.

The Django web app can then be deployed using an Amazon AWS EC2 instance to demo the underlying model.

There are weaknesses to this approach as can be discussed at a later time among which is the fact that the hierarchical nature of occupations is in no way included in this model. SVMs do have an advantage in their ease of use and performance especially for a first solution to the desired text mining problem

-j

]]>

The term "defensive bias" will be used to describe this measure of defensive effort. The first step in calculating this measure will be calculating the average scoring by the opposing (away) team in the last *N* games.

In the analysis of the data set the variable *N* for the number of games average will be set to 10. Once the average of the *N* previous games is calculated the current game's score is used to estimate the defensive bias of the home team using the following formula.

In a concrete example for the game Golden State Warriors at Los Angeles Lakers (December 28, 2008) the scoring average for the last 10 games for the Golden State Warriors is calculated and weighted against the 113 that they scored on Dec 28.

This defensive bias was calculated for multiple teams in the 2008-2009 NBA season and averaged across the season. This analysis is summarized in the following chart.

From this analysis we observe that some of the suspected best defensive teams like the 2008-2009 Boston Celtics are among the leaders in defensive bias. The best 7 teams are as follows:

Boston Celtics

Los Angeles Lakers

Cleveland Cavaliers

Dallas Mavericks

Portland Trailblazers

San Antonio Spurs

Orlando Magic

The 2010-2011 NBA season was also analyzed to have another season to compare against the instincts of those following professional basketball.

The best 7 defensive bias teams in this season are as follows:

Chicago Bulls

Miami Heat

Los Angeles Lakers

San Antonio Spurs

Dallas Mavericks

Oklahoma City Thunder

Denver Nuggets

From this analysis we see the data display the success of the San Antonio Spurs defensive system and Tom Thibodeau's success with the Bulls in his first season.

-j

]]>

The access to this data enables a quantitative look at the flow and costs of these excess military items. The data was first processed using Python to calculate the quantity of items obtained by each state through Program 1033 as well as the net cost of these items in millions of dollars. This processed was plotted using Plotly and Matplotlib with basemap to plot on map graphics. The following plot shows the distribution of costs and quantities of military items for each state. The figure image links to the interactive version hosted on plotly here.

.

Alternatively one can focus on just the costs of the items acquired under program 1033 for each state in units of millions of dollars.

Scaling has been a theme in the last few posts and the following figure touches on that subject by showing how the quantities of the individual items is distributed.

The items most commonly received are more along the lines of the items you can find at your local army surplus store as can be observed in the following list of the items received in the largest quantities.

Top Items Received

120669 MAGAZINE;CARTRIDGE

52981 SHIRT;COLD WEATHER

35490 FIELD PACK

32860 BANDAGE KIT;ELASTIC

30142 CHEST;AMMUNITION

26729 WIRE;ELECTRICAL

20367 GOGGLES;INDUSTRIAL

16952 SLEEPING BAG

14979 FLASHLIGHT

14734 BAR;METAL

13975 GOGGLES;BALLISTIC

13543 SIGHT;REFLEX

12096 BLANKET;BED

11744 MODULAR SLEEP SYSTE

11570 DRAWERS;MEN'S

11275 RUBBER SHEET;SOLID

11126 BANDAGE KIT

10402 SCREW;TAPPING

10207 STUFF SACK;COMPRESS

10204 HYDRATION SYSTEM

10016 UNDERSHIRT;COLD WEA

9859 DRESSING;COMPRESSION

9720 PAD;KNEE

9545 INTRENCHING TOOL;HA

9376 STRAP;TIEDOWN;ELECTRICAL COMPONENTS

9291 GLOVES;FLYERS'

9065 SAFETY GLASSES;REVI

8815 MODULE;TRAUMA

8782 MAT;SLEEPING

8412 HOOD;COLD WEATHER

8284 SPOON;FIELD MESS

8069 BIVY COVER

7993 POUCH;M4 THREE MAG

7740 FIRST AID KIT;UNIVERSAL

7506 DRAWERS;COLD WEATHE

7487 TUBE;METALLIC

]]>

The goal will not be in seeking to create a conclusive research paper but rather a data driven approach will be taken by looking into publicly available data for both height and income inequality. To quantify income inequality the gini cofficient will be used. This metric is not without faults but it allows for ease in collecting world data given the popular use of this metric. For quantifying height the average height of the population of a country in centimeters will be used (cm). Wikipedia has data available for both of these metrics in the following sites.

https://en.wikipedia.org/wiki/Human_height#Average_height_around_the_world https://en.wikipedia.org/wiki/List_of_countries_by_distribution_of_wealth

Python has many modules available to do data fetching/parsing/analysis and will be used for this task along with Plotly. The script for this data fetching is available here.

Lets look at the data for Male's first. The outliers in the data set are the following.

Country : Gini Cofficient : Avg Height (cm)

Japan : 0.547 : 170.7

Bolivia : 0.762 : 160

Denmark : 0.808 : 182.6

In the data one observe's a trend that one would suspect in the relationship between a having a taller population and having a smaller amount of wealth inequality. This in reflected in the negative slope in the linear fit ( y = 204 - 44.2 x ). The data is not ideal in the sense that the income inequality over a large amount of time is likely more important in the effect of nutrition effects for the population but the wikipedia data is really only a snapshot of both height and income inequality.

The data for female's shows a similar trend. The outliers in the data set are the following.

Country : Gini Cofficient : Avg Height (cm)

Japan : 0.547 : 158

Bolivia : 0.762 : 142.2

Denmark : 0.808 : 168.7

After looking at the data we observe an amount of variance which is not ideal as real world data tends to be but there is enough evidence to lightly suggest that the common belief relationship between wealth inequality and height might have something to it but data which is a snapshot over a longer time would be very useful if available.

-j

]]>As observed by the last equation power laws will look linear in log-log axes. Many real world systems follow a power law but a famous example is that of the Gutenberg-Richter law in earthquakes. The following plot shows a histogram of earthquake data from 1970 to 2014 gathered by the US Geological Survey. There are deviations as seen in the initial roll off of small magnitude events but one does observe power law scaling for events larger than a 5.0 in the Richter scale. In this case the exponent tau is measured to be around one.

What is interesting is how common such power laws can turn up in unexpected places. Reddit uses a scoring system where users up vote and down vote a comment by a user on a post by other reddit users. The following is a screenshot of Reddit’s comment system.

The scores in this screen shot are 1528,319,137,87,58,47,149,135,8,57,…. One can calculate the change in one score to the next. An example would be 319/1528 would result in approximately 0.21. This process can be continued for the comments in as many top posts as possible. In this case the first 5 pages of reddit are scraped for data and processed on a daily basis. So what is the result of counting the frequency of these measurements. How often did the score grow by 300% or 10000% from one comment to the next. By making a histogram like was done for the earthquake data the following plot is obtained. In this case the exponent tau is measured to be around one similar to the earthquake data.

From these examples we observed that a frequency drop from a Richter magnitude 5.0 earthquake to 6.0 similarly to the frequency drop of an increase by a factor of 10 in a reddit comment to the next to that of a factor of a 100.

]]>