The task of using text data to make complex decisions has become important as the availability of text data has increased even further through public APIs like the Twitter REST API. In this post the focus will be on support vector machines (SVM) for this task. SVMs have a rich history in text classification tasks and have advantages such as needing less time to train than neural networks.
The goal will be to perform classification of online job listings such as the following at http://www.simplyhired.com/job/survivability-airframe-integration-engineer-job/northrop-grumman/gryhetooim .
We want to design a system that will take this input text data and classify it as text for a defense contractor engineering job listing.
The survivability integration engineer will utilize 1D, 2D, and 3D computational electromagnetic codes such as method of moments (MoM) and physical optics (PO) to perform component and vehicle level design and analysis and ensure system survivability. Engineer will perform survivability testing and data post-processing using Knowbell or Pioneer to verify survivability design of antennas, propulsion systems, and vehicle features such as edges, wing folds, gaps and door seals. Engineer will perform testing at both indoor and outdoor testing facilities. As an Engineer 3, the engineer will perform his/her tasks under the guidance of a senior engineer with general instruction. The survivability integration engineer will be a part of an integrated product team (IPT), supporting multi-discipline advanced development programs and technology research and development (R&D). Qualifications Basic Qualifications: A minimum of a bachelor's degree in electrical, mechanical, or aerospace engineering, or physics, and five years of relevant work experience is required for this position and a minimum of 2 years of applicable experience in Applied Electromagnetics or RF survivability design. Candidates with a masters degree in any of the areas listed above are required to have at least three years of experience. Candidates with a PhD in the areas listed above having no industry experience may apply for this position if their research or post-doctoral work is applicable. The ability to obtain and maintain a Department of Defense top secret clearance and special access program clearances is required for this position. Preferred Qualifications: Preference will be given to candidates with degrees in electrical engineering. Coursework in applied electro-physics, finite element analysis, or aircraft design is desirable. Candidates with experience in signature integration, RF survivability design, analysis, and testing are highly desired. Experience with Method of Moments codes such as HFSS or COMSOL, and shooting and bouncing ray codes such as Xpatch are highly desired. Experience with RCS data processing tools, such as Knowbell or Pioneer are highly desired. Computer-aided drafting (CAD) skills with, such as Catia or UG are a plus. Preference will be given to candidates with excellent verbal and written communication skills. An active secret or top clearance is highly preferred.
Let's now begin with a quick introduction to support vector machines.
A minimal introduction to support vector machines
Support vector machines performs classification by drawing a decision boundary such as a line in two dimensions or a hyper plane in higher dimensions. This decision boundary determines the predicted class of the input data. In the preceding figure the prediction would determine if the input data should be classified as a circle or a square. The key question becomes how do we determine where to draw this line. In support vector machines the idea is to draw the line such that the white empty space is as large as possible while dividing the data. This white empty space will be referred to as the "margin" and the goal will be to maximize this margin.
We can use linear algebra to determine the length of the margin as 2/||w|| where w is a vector orthogonal/perpendicular to the decision boundary. The decision itself is given by the algebraic sign of the function
f(x) = w * x - b
The problem now becomes an optimization problem of determining the maximum 2/||w|| given the input data. Optimization has a whole industry devoted to it and for this reason we will skip the details of performing this optimization.
Obtaining a training set
To build a model input data is required but sadly this data is not readily available in an open data set therefore it will have to be scraped from online sources and preprocessed using some basic natural language processing (NLP) techniques such as the removing of stopwords (http://en.wikipedia.org/wiki/Stop_words) . To perform this task a web scraper was written in python using a collection of python libraries such as urllib, beautiful-soup4, and nltk . The data was scraped from the result of web searches on the job listings site SimplyHired. The search terms where the occupations which would be the target class labels.
The result was a single line of text like the following for a result from an accounting specialist web search on simply hired job listing .
We currently great opportunity finance industry Experis seeking Accounting responsible wide range input processing core accounting system Great Essential Duties Responsibilities Assist processing lease bookings lease end lease processing cash receipts lease discounting etc Process periodic billing daily cash receipts Respond wide range requests documentation group sales staff customers Required Knowledge Skills Abilities Associate degree Accounting plus years experience solid general accounting knowledge leasing industry experience Experience Crystal Reports plus Proficient Excel Word Ability learn specific software applications Attention detail strong organizational skills Ability handle wide variety tasks Ability interface customers project professional image times Flexibility handle uneven nature work load deadlines Experis Who We Are As leader project solutions professional talent resourcing contract permanent positions Experis matches professionals rewarding Finance IT Engineering opportunities industry leading organizations helping accelerate careers delivering excellent results client companies Experis part world leader innovative workforce solutions Experis Benefits Expand connections Grow experience skills Accelerate career And even work us Experis professionals earn comprehensive benefits industry Along competitive pay benefits may include dental vision disability life insurance flexible spending accounts employee stock purchase plan holiday pay thousands online training courses develop current skills explore something new interviewing consulting advice job search tips helpful tools Learn Experis jobs Experis Equal Opportunity Employer
Training the support vector machine model
The first step of the modeling the text is to vectorize the text data which maps the text data into a multidimensional space where we can use hyper planes to decide on a prediction for the occupation label of the text data. A common approach which will be used here is to vectorize text data is using tf-idf to help vectorize the data (http://en.wikipedia.org/wiki/Tf%E2%80%93idf) . Due to the nature of text these vectors will be sparse vectors which eliminates the practicality of using other classification techniques more suitable for dense vector data like random forest classification. The labels for the occupation labels will be mapped into integers by using a dictionary to map the integer keys to the actual text of the occupation.
The input vector data can then be model using support vector machines as desired. This was implemented using python's scikit-learn library. The result is a model which can take the input job listing and classify it into an occupation category. However inputting data requires knowledge of python I/O therefore a simple website is needed to allow for demoing the model by the input of a webpage url to classify the data.
The Django webapp
To actually allow webpage urls to be easily inputted a django webapp will be deployed. This Django app will interface with the underlying model by fetching the input url and pre-processing it into a form which the underlying SVM model can use to make a prediction. This prediction can then be displayed back to the user. The following is a screenshot from the resulting webapp.
The Django web app can then be deployed using an Amazon AWS EC2 instance to demo the underlying model.
There are weaknesses to this approach as can be discussed at a later time among which is the fact that the hierarchical nature of occupations is in no way included in this model. SVMs do have an advantage in their ease of use and performance especially for a first solution to the desired text mining problem