The task is to query the system each and store the results. The goal is to have sufficient data to process through a Hadoop/Spark process and perform text analysis. Many of the postings, especially from agencies, are duplicates. I want to see how well one could match job posting using Spark machine learning clustering.
Since the API will also return lat and lon there is the opportunity
to do some spatial analysis.
The API for the job site lets you filter by keyword, state, city, date and employer or employer/agency.
You can also limit the data returned each time. Using the ‘-‘ with a keyword will ignore listings that include that word.
At this stage I want to ignore jobs from agencies and contract jobs. This is because many agencies post the same job and many ignore the location,i.e post jobs in MA that are located in CA.
For the second part of this experiment I will change this to pick up all jobs and try to use to classification to identify similar jobs.
I define several lists:
1. The states to check
2. The language keywords to use
3. Skill set keywords
states = [‘ME’,’MA’,’TN’,’WA’,’NC’,’AZ’,’CA’]
languageset = [‘java’,’ruby’,’php’,’c’,’c++’,’cloture’,’javascript’,’c#’,’.net’]
skillset = [‘architect’,’team lead’]
The API expects a parameter dictionary to be passed in. The default dictionary is:
Besides using the “-” to ignore keywords I am setting “as_not” to two agencies that I know to ignore. “sr” and “st” are set to try and avoid contract jobs and agencies.
params = {
‘as_not’ : “ATSG+Cybercoders”,
‘sr’:”directhire”,
‘st’:”employer”,
‘l’ : “ma”,
‘filter’:”1″,
‘format’:”json”,
‘fromage’:”last”,
‘limit’ :”100000″,
‘start’: “0”,
‘limit’:”100000″,
‘latlong’:”1″,
‘psf’:”dvsrch”,
‘userip’ : “xxx.xxx.xxx.xxx”,
‘useragent’ : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)”
}
Each day(yes, I should set it up with cron to run) I run it.
I convert the data into a Result object.
class Result:
name,company,location,jobtitle,city,state,country,source,date,latitude,longitude,jobkey, expired,sponsored ,snippet,url,meta
It is probably over kill and I could simply skip this step and go right to the database. I really like the compactness of the object and it makes the code look cleaner.
for each languageset
for each state
query()
convert the data to a result object
get the url to the main posting
get the the page
use BeautifulSoup to parse the html
get the content section of the page
store the result in Neo4j database
add to Neo4j
use graph.merge_one() to create state,city and language nodes
create new Job node(jobkey)
job key is from the api and has a unique constraint to avoid adding the same one again.
set Job properties(lat,lon,url,snippet.content, posted date, poll data)
create relationships
Relationship(job, “IN_STATE”, state)
Relationship(job, “IN_CITY”, city)
Relationship(job, “LANGUAGE”, lang)
That is the code. after a few false starts I have been able to get it run and gather about 16k listings.
Results
Below are some of the query results. I used Cypher’s RegEx to match on part of the job title.
match (j:Job)–(state:State{name:’NH’}) where j.name =~ ‘Java Developer.*’ return j,state
match (j:Job)–(state:State{name:’NH’}) where j.name =~ ‘Chief.*’ return j,state
- Java Developer in New Hampshire
- Java Developer in Maine and New Hampshire
- Chief Technology in New Hamphire
- PHP Developer in all the states polled
One of the goals is run Spark’s machine learning lib against the data. As a first test I will count the words in the job title. In order to determine if the process is working I counted the words in job titles for New Hampshire. Now I have something to compare to after the Spark processing.
Below is a sample of the word count for all job polled in New Hampshire
word | count |
---|---|
analyst | 14 |
application | 10 |
applications | 7 |
architect | 15 |
architecture | 4 |
associate | 3 |
automation | 5 |
business | 2 |
chief | 2 |
cloud | 4 |
commercial | 3 |
communications | 4 |
computer | 2 |
consultant | 4 |
database | 4 |
designer | 3 |
developer | 59 |
development | 15 |
devices | 3 |
devops | 3 |
diem | 3 |
electronic | 2 |
embedded | 5 |
engineer | 83 |
engineering | 2 |
engineers | 2 |
integration | 4 |
java | 31 |
junior | 3 |
linux | 3 |
management | 4 |
manager | 8 |
mobile | 3 |
I have Hadoop and Spark running. I need to get mazerunner installed and run a few tests. Then the fun begins…