back to the jobs data..

The task is to query the system each and store the results. The goal is to have sufficient data to process through a Hadoop/Spark process and perform text analysis. Many of the postings, especially from agencies, are duplicates. I want to see how well one could match job posting using Spark machine learning clustering.

Since the API will also return lat and lon there is the opportunity
to do some spatial analysis.

The API for the job site lets you filter by keyword, state, city, date and employer or employer/agency.
You can also limit the data returned each time. Using the ‘-‘ with a keyword will ignore listings that include that word.
At this stage I want to ignore jobs from agencies and contract jobs. This is because many agencies post the same job and many ignore the location,i.e post jobs in MA that are located in CA.

For the second part of this experiment I will change this to pick up all jobs and try to use to classification to identify similar jobs.

I define several lists:
1. The states to check
2. The language keywords to use
3. Skill set keywords

states = [‘ME’,’MA’,’TN’,’WA’,’NC’,’AZ’,’CA’]
languageset = [‘java’,’ruby’,’php’,’c’,’c++’,’cloture’,’javascript’,’c#’,’.net’]
skillset = [‘architect’,’team lead’]

The API expects a parameter dictionary to be passed in. The default dictionary is:

Besides using the “-” to ignore keywords I am setting “as_not” to two agencies that I know to ignore. “sr” and “st” are set to try and avoid contract jobs and agencies.

params = {
‘as_not’ : “ATSG+Cybercoders”,
‘sr’:”directhire”,
‘st’:”employer”,
‘l’ : “ma”,
‘filter’:”1″,
‘format’:”json”,
‘fromage’:”last”,
‘limit’ :”100000″,
‘start’: “0”,
‘limit’:”100000″,
‘latlong’:”1″,
‘psf’:”dvsrch”,
‘userip’ : “xxx.xxx.xxx.xxx”,
‘useragent’ : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)”
}
Each day(yes, I should set it up with cron to run) I run it.

I convert the data into a Result object.

class Result:
name,company,location,jobtitle,city,state,country,source,date,latitude,longitude,jobkey, expired,sponsored ,snippet,url,meta

It is probably over kill and I could simply skip this step and go right to the database. I really like the compactness of the object and it makes the code look cleaner.

for each languageset
for each state
query()
convert the data to a result object
get the url to the main posting
get the the page
use BeautifulSoup to parse the html
get the content section of the page
store the result in Neo4j database

add to Neo4j
use graph.merge_one() to create state,city and language nodes

create new Job node(jobkey)
job key is from the api and has a unique constraint to avoid adding the same one again.
set Job properties(lat,lon,url,snippet.content, posted date, poll data)
create relationships
Relationship(job, “IN_STATE”, state)
Relationship(job, “IN_CITY”, city)
Relationship(job, “LANGUAGE”, lang)

That is the code. after a few false starts I have been able to get it run and gather about 16k listings.

Results

Below are some of the query results. I used Cypher’s RegEx to match on part of the job title.

match (j:Job)–(state:State{name:’NH’}) where j.name =~ ‘Java Developer.*’ return j,state
match (j:Job)–(state:State{name:’NH’}) where j.name =~ ‘Chief.*’ return j,state

  • Java Developer in New Hampshire
  • Java Developer in Maine and New Hampshire
  • Chief Technology in New Hamphire
  • PHP Developer in all the states polled

java dev nh me

Java Developer in New Hampshire and Maine

java dev nh

Java Developer in New Hampshire

ct nh

Chief Technology New Hampshire

php dev

PHP Developer all States

One of the goals is run Spark’s machine learning lib against the data. As a first test I will count the words in the job title. In order to determine if the process is working I counted the words in job titles for New Hampshire. Now I have something to compare to after the Spark processing.
Below is a sample of the word count for all job polled in New Hampshire

word count
analyst 14
application 10
applications 7
architect 15
architecture 4
associate 3
automation 5
business 2
chief 2
cloud 4
commercial 3
communications 4
computer 2
consultant 4
database 4
designer 3
developer 59
development 15
devices 3
devops 3
diem 3
electronic 2
embedded 5
engineer 83
engineering 2
engineers 2
integration 4
java 31
junior 3
linux 3
management 4
manager 8
mobile 3

I have Hadoop and Spark running. I need to get mazerunner installed and run a few tests. Then the fun begins…

About gricker

Living and learning
This entry was posted in Uncategorized. Bookmark the permalink.