back to the jobs data..

The task is to query the system each and store the results. The goal is to have sufficient data to process through a Hadoop/Spark process and perform text analysis. Many of the postings, especially from agencies, are duplicates. I want to see how well one could match job posting using Spark machine learning clustering.

Since the API will also return lat and lon there is the opportunity
to do some spatial analysis.

The API for the job site lets you filter by keyword, state, city, date and employer or employer/agency.
You can also limit the data returned each time. Using the ‘-‘ with a keyword will ignore listings that include that word.
At this stage I want to ignore jobs from agencies and contract jobs. This is because many agencies post the same job and many ignore the location,i.e post jobs in MA that are located in CA.

For the second part of this experiment I will change this to pick up all jobs and try to use to classification to identify similar jobs.

I define several lists:
1. The states to check
2. The language keywords to use
3. Skill set keywords

states = [‘ME’,’MA’,’TN’,’WA’,’NC’,’AZ’,’CA’]
languageset = [‘java’,’ruby’,’php’,’c’,’c++’,’cloture’,’javascript’,’c#’,’.net’]
skillset = [‘architect’,’team lead’]

The API expects a parameter dictionary to be passed in. The default dictionary is:

Besides using the “-” to ignore keywords I am setting “as_not” to two agencies that I know to ignore. “sr” and “st” are set to try and avoid contract jobs and agencies.

params = {
‘as_not’ : “ATSG+Cybercoders”,
‘l’ : “ma”,
‘limit’ :”100000″,
‘start’: “0”,
‘userip’ : “”,
‘useragent’ : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)”
Each day(yes, I should set it up with cron to run) I run it.

I convert the data into a Result object.

class Result:
name,company,location,jobtitle,city,state,country,source,date,latitude,longitude,jobkey, expired,sponsored ,snippet,url,meta

It is probably over kill and I could simply skip this step and go right to the database. I really like the compactness of the object and it makes the code look cleaner.

for each languageset
for each state
convert the data to a result object
get the url to the main posting
get the the page
use BeautifulSoup to parse the html
get the content section of the page
store the result in Neo4j database

add to Neo4j
use graph.merge_one() to create state,city and language nodes

create new Job node(jobkey)
job key is from the api and has a unique constraint to avoid adding the same one again.
set Job properties(lat,lon,url,snippet.content, posted date, poll data)
create relationships
Relationship(job, “IN_STATE”, state)
Relationship(job, “IN_CITY”, city)
Relationship(job, “LANGUAGE”, lang)

That is the code. after a few false starts I have been able to get it run and gather about 16k listings.


Below are some of the query results. I used Cypher’s RegEx to match on part of the job title.

match (j:Job)–(state:State{name:’NH’}) where =~ ‘Java Developer.*’ return j,state
match (j:Job)–(state:State{name:’NH’}) where =~ ‘Chief.*’ return j,state

  • Java Developer in New Hampshire
  • Java Developer in Maine and New Hampshire
  • Chief Technology in New Hamphire
  • PHP Developer in all the states polled

java dev nh me

Java Developer in New Hampshire and Maine

java dev nh

Java Developer in New Hampshire

ct nh

Chief Technology New Hampshire

php dev

PHP Developer all States

One of the goals is run Spark’s machine learning lib against the data. As a first test I will count the words in the job title. In order to determine if the process is working I counted the words in job titles for New Hampshire. Now I have something to compare to after the Spark processing.
Below is a sample of the word count for all job polled in New Hampshire

word count
analyst 14
application 10
applications 7
architect 15
architecture 4
associate 3
automation 5
business 2
chief 2
cloud 4
commercial 3
communications 4
computer 2
consultant 4
database 4
designer 3
developer 59
development 15
devices 3
devops 3
diem 3
electronic 2
embedded 5
engineer 83
engineering 2
engineers 2
integration 4
java 31
junior 3
linux 3
management 4
manager 8
mobile 3

I have Hadoop and Spark running. I need to get mazerunner installed and run a few tests. Then the fun begins…

Posted in Uncategorized | Leave a comment

Machine learning Scrum Sprint estimates

Another idea as part of my “A year with data” exploration.

Anyone who has worked in a Scrum/Agile environment understands the pain involved with task estimation. Some of the methods include shirt sizes (S, M, L),  the Fibonacci sequence (1, 2, 3, 5, 8.), powers of 2 (1, 2, 4, 8) and even poker. Then there is the process of dealing with disparate estimates. One person gives an estimate of 2 and another suggests its  13 . After some discussion its agreed that the task is a 8. At the end of the sprint maybe it turns out that the task really was a 5. It would be useful,and interesting, to determine how well people do in their estimation. Is person A always under estimating? Person B is mostly spot on,….

This seems like a good candidate for Machine Learning, supervised learning to be more specific. I am not sure how many teams capture  information from the estimation process but they should.

Basic information such as:

  • The original estimates for each team member
  • The final agreed upon estimates
  • The actual size of the task once completed

The data might look like this:

Task TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 Actual
1 1 8 1 13 5 8 2 5 13 8
2 3 8 5 8 8 5 3 1 8 5
3 2 5 5 5 5 2 1 8 1 3
4 8 5 6 3 1 2 2 13 5 5
5 3 5 5 8 8 8 8 13 13 13
6 1 3 5 1 1 1 1 2 5 2
7 1 3 5 1 1 5 8 5 3 2
8 5 3 5 3 2 1 1 3 2 1
9 8 8 6 5 8 8 13 3 5 5
10 2 5 5 8 8 8 8 8 8 13

The ‘training’ data consists of ten tasks, the estimates from each of the nine team members and actual value the task turned out to be. I choose the Fibonacci sequence as a method for estimates.  Another piece of information that could be useful is the estimate the team agreed upon. That could be compared to the actual value as well. I decided not to do since it hides the interesting information of each team members estimate. By using each team members input we could determine which ones are contributing more or which ones are further off in their estimates.

Gradient decent

I am not going to try and explain Gradient Decent as there are others much better qualified to do the that. I found the Stanford Machine learning course to be the most useful. The downside is that the course used Octave and  I want to use Python. There is a bit of a learning curve trying to make the change. Hopefully I have this figured out.

The significant equations are below.

The cost function J(θ) represents how well theta can predict the outcome.

h theta

Where xj(i) represents each team member’s estimate for all of the task.
x(i) represents the estimate(feature) vector of the training set.
θT is the transpose of the theta; vector

hθ(x(i)) is the predicted value.

The math looks like this.

j theta



For this I am using the following python packages


Note: In order to use seaborn residplot I had to install ‘patsy’ and ‘statsmodel’

easy_install patsy
pip install statsmodels

Set up pandas to display correctly
pd.set_option(‘display.notebook_repr_html’, False)

The first step is to read the data.

training_set = pd.read_csv('estimate data.txt')

Next, we need to separate the estimates from the actual values

 tm_estimates = training_set[['tm1','tm2','tm3','tm4','tm5','tm6','tm7','tm8','tm9']] 
        tm1  tm2  tm3  tm4  tm5  tm6  tm7  tm8  tm9
         1    8    1   13    5    8    2    5   13
         3    8    5    8    8    5    3    1    8
         2    5    5    5    5    2    1    8    1
         8    5    6    3    1    2    2   13    5
         3    5    5    8    8    8    8   13   13
         1    3    5    1    1    1    1    2    5
         1    3    5    1    1    5    8    5    3
         5    3    5    3    2    1    1    3    2
         8    8    6    5    8    8   13    3    5
         2    5    5    8    8    8    8    8    8

 actuals = training_set['est']

A distribution plot of the actuals
actuals dis plot


One thing to consider is the Normalization of the data. This is important when data values vary greatly. In this case the data is not all that different but its worth the effort to add this step.


 mean = tm_estimates.mean()
    std = tm_estimates.std()

    tm_estimates_norm = (tm_estimates - mean) / std
tm1 tm2 tm3 tm4 tm5 tm6 tm7 tm8 tm9
-0.147264 1.312268 0.143019 0.661622 1.031586 0.064851 -0.403064 -1.184304 0.405606
-0.515425 -0.145808 0.143019 -0.132324 0.093781 -0.907909 -0.877258 -1.184304 0.405606
1.693538 -0.145808 0.858116 -0.661622 -1.156627 -0.907909 -0.640161 0.441211 -1.264536
-0.883585 -1.117858 0.143019 -1.190919 -1.156627 -1.232162 -0.877258 1.602294 1.598564
-0.883585 -1.117858 0.143019 -1.190919 -1.156627 0.064851 0.782419 -0.952088 -0.310169
0.589057 -1.117858 0.143019 -0.661622 -0.844025 -1.232162 -0.877258 -0.255438 -0.787353
1.693538 1.312268 0.858116 -0.132324 1.031586 1.037610 1.967903 -0.719871 -1.025944
-0.515425 -0.145808 0.143019 0.661622 1.031586 1.037610 0.782419 0.441211 0.405606

To satisfy the equation we need to add an extra column for theta0. For that we add x0 and set all of the values to 1

# the number of data points
m = len(tm_estimates_norm)
#add the x0 column and set all values to one.
tm_estimates_norm['x0'] = pd.Series(np.ones(m))

Next we define the learning rate alpha to be 0.15. The number of iterations is 150. Setting these two values will control how well the cost function converges.

    alpha = 0.15
    iterations = 150


Set the initial values of theta to zero. Then convert the data into numpy arrays instead of python strutures.

    # Initialize theta values to zero
    thetas = np.zeros(len(tm_estimates_norm.columns))
    tm_estimates_norm = np.array(tm_estimates_norm)
    estimations = np.array(actuals)
    cost_history = []

Now do something!
First calculate the prediction. Theta . estimates.
Next perform the the J(0) calculation
Calculate the cost and record the cost. This last step will tell us if the process is decreasing or not.

    for i in range(iterations):
    # Calculate the predicted values
        predicted =, thetas)

        # Calculate the theta 
        thetas -= (alpha / m) * - estimations), tm_estimates_norm)
        # Calculate cost
        sum_of_square_errors = np.square(predicted - estimations).sum()
        cost = sum_of_square_errors / (2 * m)

        # Append cost to history

I tried different combinations of alpha and iterations just to see how this works.

The first attempt is using alpha = 0.75
high learning rate 75

This next try uses alpha = 0.05 and iterations = 50
lower learning rate 05

This last on represents alpha = 0.15 and iterations = 150

7.923283-0.076717 5.4614750.461475/td>  3.4814650.481465 4.404572-0.595428  14.2873011.287301 1.225380–0.774620  2.7378480.737848 /td>  .467125.0207895.020789- 0.020789 10.990762-2.009238

actuals predictions difference
8 7.923283 -0.076717
5 5.461475 0.461475
3 3.481465 0.481465
5 4.404572 -0.595428
13 14.287301 1.287301
2 1.225380 -0.774620
2 2.737848 0.737848
1 1.467125 0.467125
5 5.020789 /td> 0.020789
13 10.990762 -2.009238

This graph shows the linear fit between the predicted and actual values


This graph shows the difference between the predicted and actual values


The data set is far too small to declare anything. The cases where the actual was high there is less data and the error is greater. In order to get more data I’ll have to make it up. Having worked in development for years( many) I know that people tend to follow a pattern when giving estimates. Also the type of task will dictate estimates. A UI task may seem simple to someone familiar with UI development. While a server/backend person may find a UI task daunting. In deciding how to create sample data I devised a scheme to give each team member a strength in  skills, UI, database, and server. Also each member has a estimation rating. This defines how they tend to rates tasks, low, high, mix or random. Once I get this working I start over and see how this new data performs.


Until then…

Posted in data | Leave a comment

A year with data: Day 2

Wow! two days in a row, good for me.

Where to start… Probably with the data.

Project tycho.
The project has gathered data(level 2) over a 126 year period(1888 to 2014). Divided in to cases and deaths it include fifty diseases, fifty states and 1284 cities. Access is via a web service. There are calls to get a list of all diseases, states, cities, cases and deaths. Using Python, I pulled the various pieces and stored each is a file. The process takes a while it was better to get the data once and then format it as needed. For each state/city I also obtained the lat/lon information. Finally I gathered all of the data into one file where each record looked like the ones below:

Event St City Disease Year Week Count Lat Lon
Case, AK, KETCHIKAN,MEASLES, 1914, 24, 1, 55.34222219999999, -131.6461111
Death,AK, KETCHIKAN,MEASLES, 1914, 24, 1, 55.34222219999999, -131.6461111

The process:
1. Get all diseases.
2. Get all States.
3. For each State get all cities.
4. For each State/City geocode the city.
5. For each Disease.
For each State and City get events.

Some python code:

The code below gives examples of how to pull the data from the Tycho site. The key is assigned by them. I found some cases where there are ‘[‘ and ‘]’ characters is the data. Since I couldn’t determine what to do with this I simply skip it. I also check for commas and spaces which make parsing difficult.

def get_disease(key):
    listOfdisease = []
    url = ''+key 
    response = urllib.request.urlopen(url)
    html =
    xml = et.fromstring(html.decode('utf8'))
    myfile = open("data/disease"+".data","w")
    for element in xml.findall("row") :
        type = element.find("disease")
        # remove characters we dont want '[' ']' '/'
        if  not  "[" in  type.text and  not  "]"  in  
        type.text and not  "/" in type.text:
            type.text = type.text.replace(" ","_")
            type.text =type.text.replace(",","")
            print (type.text )

    return listOfdisease

Find the state from the string. Each state is defined by the tag ‘loc’

   xml = et.fromstring(html.decode('utf8'))
    for element in xml.findall("row") :
        StateAbv = element.find("state")

        State = element.find("loc")

Finding cases or deaths is bit more complicated. The field ‘number’ represents the number of events for that period.

 for element in xml.findall("row") :
        year = element.find("year")
        week = element.find("week")
        number = element.find("number")
        if int(number.text)  > 0:
            case = Case(disease,year.text,week.text,number.text,state)  

(Moving to GitHub soon).

The hardware.  For the most part I just use a Windows laptop. I need to run Hadoop and Spark and since I use the laptop for work I need a different solution.Something that I can run without disruption . Hadoop is marketed as running on commodity hardware. Lets see. I have two old systems that I have installed Linux(Ubuntu) on. Also I installed Hadoop and a host of other support stuff. I need to get these set up as a cluster at some point.

Posted in Uncategorized | Leave a comment

A year of data..

A year with data

I have been trying to work on data analysis for a while, but its been a lot of start and stop. I started with pure spatial data(University of Maine, Spatial Information) and then started working with public health data. Eventually I came to understand that the two are connected. Considering where events occurred can be helpful in understanding how to handle
public health issues. Some guy named Snow figured this out in the 1850’s. The big data movement has made things like machine learning,R, NLP, Hadoop, Pandas and Spark popular. I have decided to spend the next year mucking about with a couple of data sets to get a better idea of what can and can’t be done.

The data.
There are two data sets I plan to use(so far).
The first is public health information assembled by Project Tycho, University of Pittsburgh . The project has gathered public health data for over a hundred years. It consists of events(cases or deaths) due to disease. Each event is associated with a State, City, Year, and Day of the year. I have added Lat and Lon for each City.

The second set is being created each day(when I remember…). It involves pulling data from a job board using their api. This data is nice because it is changing every day. It also has a lot of free text that might be useful for NLP or classification.

My day job involves Java. For this effort I’d like to stay with Python. There are some exceptions where Java might make more sense, loading large data sets, or Hadoop MapReduce. I am using Django to create web apps as needed.
Python has it own analysis library, but works well with R. Probably a good path to stay with.

My favorite data store is Neo4j, the graph database.

Posted in Uncategorized | Leave a comment

Multi-channel Attribution Using Neo4j Graph Database

Neo4j Graph:Multi-channel Attribution

Business Applications


Globally more than $500 Billion was spent on advertising (Lunden, 2013). One of the greatest challenges of spending money on advertising is trying to understand the impact of those dollars on sales. With the proliferation of multiple mediums or channels (TV, search engines, social media, gaming platforms and mobile) on which precious marketing dollars can be spent, a Chief Marketing Officer (CMO) is in dire need of insights into the return on his investment in each medium. More importantly, the CMO needs timely data to prove that spending on a specific channel has a good return on investment. Neo4j can be used to help marketing applications get answers to tough questions:

  • How much was the increase in web awareness of the product after a commercial was aired in a specific TV channel on a specific date in a specific geographic area?
  • How much of that web awareness translated into…

View original post 613 more words

Posted in Uncategorized | Leave a comment

Image processing to find tissue contours

The GRiTs project( considered how genes interact with each other
in space and time. Evaluating this the process begins by determining the border structure of the image. This is done manually using a drawing tool and outlining the border. The resulting border data points were captured and feed into a tool such as 3dMax to create a multidimensional shell. This allowed the image tool a reference point for aligning gene data points in the volume.

I have been interested in processes that would make this more automated. The GRiTS viewer tools were developed in C++ and QT. ImageJ was also used to pre-process the images to remove some of the extraneous information within the image.


OpenCV is a C++ library designed for various image processing and machine vision algorithms. Here is a sample image that I am using.

The first thing is that some of the images have color and some are gray scale. The colors are used in some cases to indicate specific piece of information. In this case I am looking for contours and for that a gray scale image works better.


The function cvtColor() will work to covert to gray scale.In the version of OpenCV I am using the call takes the enum CV_BGR2GRAY as the option. In later version I believe this has changed to COLOR_BGR2GRAY. The function takes a src and dest image along with the appropriate code.


There are a number of blur options available. I have started out with the basic normalized box filter blur. I have set the kernel size as 3×3 to start with.


This process removes unwanted values. But “unwanted” depends on the image and what you are trying to eliminate. In this case I am looking for pixels that make up the boundary. I don’t want to be harsh in removing values since this causes large gaps that are to fill. For this test I have selected the value to be 50 and the max value to be 250. Of course these values will change depending on the image. I suspect that this will require applying some statistics and machine learning to create “best guess”   starting values. After all the goal is make this as automatic as possible.

threshold(src, dest, threshold_value, max_BINARY_value, THRESH_TOZERO);


After blur and threshold I applied an edge detection process. For this I used the Canny algorithm.  The  lowThresh and highThresh are used to define the threshold levels for hysteresis process. The edgeThresh is the window size for aperture size for the Sobel operator.

int edgeThresh = 3;
double lowThresh = 20;
double highThresh = 40;
Canny(src, dest, lowThresh , highThresh , edgeThresh );

openview contour


The edge detection process creates a lot of segments. Contouring will try to connect some of the segments into longer pieces.

CV_RETR_EXTERNAL: find only outer contours
CV_CHAIN_APPROX_SIMPLE: compresses segments
Point(0, 0) : offset for contour shift.
Contours are stored in the contours variable, a vector<Vec4i>
findContours( edgeDest, contours, hierarchy,CV_RETR_EXTERNAL,  , CV_CHAIN_APPROX_SIMPLE, Point(0, 0) );



I took the same processes used with OpenCV and implemented them with ImageJ. Using ImageJ is different then OpenCV. It is really designed so that the developer create plugins that the ImageJ tool can use. I expected to use ImageJ as a library, part of a bigger app.


This was used to create the above image. The raw image is on the left.

ImagePlus imp =IJ.openImage("Dcc29.jpg");
ImageProcessor rawip = rawimp.getProcessor();
rawip = rawip.resize(rawip.getWidth()/2,rawip.getHeight()/2);
ImageConverter improc = new ImageConverter(imp);
ImageProcessor ip = imp.getProcessor();
ip = ip.resize(ip.getWidth()/2,ip.getHeight()/2);

Scale over time

Another issue is scale. The complete set of images represent development over a period of time. In the beginning the image are small. By the end of the series they are considerably larger. Mapping positions on the cell images as they grow is still a challenge. Landmarks change over time, coming and going, so they can’t be counted on. The images are obtained at intervals which are relative close to each other. This means that points on one image will be close to others images in similar positions at similar time periods.

Consider the an image in the middle(image #20) at day 15. Points on this image should be close to points on a middle image at day 18.

By interpolating between images it may be possible to track point movement over a period of time.

Posted in Uncategorized | Leave a comment

Internet of Related Things IORT

The topic,Internet of Things(IOT) is quite popular today. There is a lot of discussion about IP enabled devices, API’s, storage, and other ideas that address the “How” of IOT. Having spent a good many years developing software this process is one I have seen over and over. A customer wants A and immediately everyone starts to suggest ways to implement A. Rarely does anyone ask “why A?”, what is the problem being solved here?

As I see it the idea behind IOT is that there are “things” in the world that we’d like to track and in order to do so they to be internet enabled. Of course some things have to be internet enabled but not all things. I spent a number of years working with RFID in both access control and asset tracking. It is not simple to track things based simply on location, typically there needs to be some additional context. This helps the system or user make more informed decision about an item’s real location. Location could be by RFID, BlueTooth, barcode, WiFi, or logged manually by a person(car is parked in the Donald Duck lot, isle 134, row 654). Context let the user or system make sense of the location. Is the item really in the paint shop or the radio shop? If the item is electronic, a radio, then is not likely in the paint shop.

When designing a system it is worth considering how people perform a similar activity. In this case the context of knowing what the item is may not help. An item such as car keys could be anywhere. knowing more about the item doesn’t really help. How do I find the car keys? One way is to wander around the house, look in the car, or check various coat pockets. More likely I will ask someone. That someone may not know but they may know someone who does. I might start by asking my daughter. If she doesn’t know she could suggest asking mom since they went to the store in my car recently. I’ll call my wife and she will tell me they are in the in the junk drawer on the counter in the kitchen.
Another scenario, tracking a package. If I inquire about a package I am expecting, the delivery service doesn’t know where the box is. They only know where it is in relation to other things,such as a truck, warehouse or plane. The service will likely know where the truck or driver is currently at or the most recent location. They can tell that my package is on a truck that is in Portland Maine. Its is 5:00pm and I am in Orono Maine, 3 hrs away and so not likely to see my package until tomorrow. The idea is that people manage things not by absolute location but through relationships. The package is related to the truck that has a known location. The car keys are in the junk drawer in the kitchen, I know where to look.

The process can best be described with a graph. Finding a thing involves finding a path between nodes, the car keys and me. Its about how nodes(Things) are related to each other. Below is a diagram of how relations might be described.

Relations are defined as we naturally see them. A Family is made up of persons, not rooms or items.

Relationships are further defined using graph terms, vertices or nodes together with edges or links. I am going to use a graph database and therefore I have assigned properties to the edges. They are not needed for what I am trying to do right now, they may be useful if ever the need to define more specific paths, or ignore some paths.

Test the idea

To try this idea I am using Neo4j( graph database along with Python( The basic process is to create nodes and then connect them by defining Edges(Relationships).

One aspect of Neo4j that I am not settled on is the ability to define “relationships” as needed. Its useful to the developer but to a user there is no way for them to know what relationships are used and how. Something like RDF triples maybe?

For this test there are three relationships. For the most part I use “CONTAINS”.

  1. A person “BELONGTO” a family
  2. A person “LIVESIN” a house
  3. A house “CONTAINS” a room.


iort diag


Create nodes and edges

Use Python and py2neo to create nodes and then the relationship between them.

open a connection to the running service
graph = Graph()

create a node for the Family
family = Node(“Family”, name=”Family”, title=”Smith”)
graph.create(family )

create a node for the House
house = Node(“House”, name=”House”, title=”Smith Family House”)

create a Location node
kitchen =Node(“Location”, name=”Room”, title=”Kitchen”)

create an Item node
oven =Node(“Item”, name=”Appliance”, title=”Oven”)

Finally,create a person node
dad = Node(“Person”, name=”Person”, title=”Dad”)

Relate the nodes

iort diag rel1

iort diag rel2

An Example of creating the relationships.
graph.create(Relationship(dad, “BELONGTO”, family))
graph.create(Relationship(house, “CONTAINS”, kitchen))
graph.create(Relationship(kitchen, “CONTAINS”, oven))

The graph


Where are my keys?

Finally its time to find the car keys.
What I really want is the path from me(Dad in this example) to the keys. Neo4j can find the shortest path between two nodes fairly easily.

Using Cypher from the database browser I executed this MATCH query:

MATCH (person:Person{name:'Dad'}),(keys:Thing { name : 'Thing', title : 'Car Keys' }),
p = shortestPath((person)-[*]-(keys))

The result of the search indicates the path from “Dad” to the “car Keys”.

match keys

How does the junk drawer know about the car keys? Maybe the blue tooth or RFID fob on the car key? The junk drawer is a plain drawer with a “smart pad” inside? I am still focused on the What or Why, the “How” is about the technology and that will will work out as the needs become clearer.

One issue that will arise is when a “thing” is in two places at the same time. This wouldn’t likely happen in the package delivery scenario but with blue tooth devices in a home setting it might if two sensors “see’ the same device. A device such a smart pad should be designed(or configurable) to detect only a close range. The system could take advantage of historical data to predict the likely hood of a thing being in one of the two places.

To start with the “How” is to look for a problem to fit the solution. When people can understand what the idea of IOT can offer there will be a driving interest to figure out the “How”


Posted in Uncategorized | Leave a comment

Graph Database – project Tycho with RNeo4j

Before I continue with R, I discovered another issue. When I started to process the data sets I obtained the Lat, Lon for each of the cities. I have used the spatial component of Neo4j before in a Java application. I was disappointed to discover that py2Neo doesn’t support this yet. The Neo4j Python REST Client does have support. Since spatial indexes can be added to existing data its not a big deal.

Having read Nicole White’s writings on Neo4j( and R I was excited to see how R works with Neo4j.

R and Neo4j
Having used R for spatial analysis I was curious to see what could be done with R and the Tycho data. I have yet to add spatial indexes but there is still a lot of other operations that don’t need spatial.

I am using R-Studio. RNeo4j can be found at The instructions for installation  work well.  Following the examples given with the install I was able to run a simple query and obtain a data frame.

Load RNeo4j
create a connection
graph = startGraph("http://localhost:7474/db/data)"
This just tells me that the connection is good.

Define a query. Here I am using one that just returns the list of states.
The first time I tried to run the query just as it was in Python.
query = "MATCH (n:`State`) RETURN n"
I received this strange error:

Error in, row.names = rlabs) :
supplied 54 row names for 1 rows

Nicole was quick to point out that to use Cypher to return nodes you should use getNodes(). For relations use getRels().
The new query: "MATCH (n:State) RETURN n.description"

Now running the query there is no error and I get a list of states.
df = cypher(graph, query)
Display the results

1 CA
2 GA
3 CT
4 AK
5 AL
6 CO
7 DC
and so on

The query "MATCH (n:Case) RETURN n.year" returns too much data to print out. Instead I filtered the data to a date range,n.year > 1900 and n.year 1900 and n.year < 1950 RETURN n.year,count(*) as count”
Looking at rows of data is okay but what I really wanted was something visual.
Again, Nicole’s site has examples of plotting. I started with plot().
This took a bit of experimenting to get it right.

plot(df,xlab="year",ylab="count",main="CASES counts, all states, between 1900 and 1950" )

cases 1900 and 1950

This is more like it. What else is there?
ggplot() ?

The command below took a while to get correct. The aes() function wasn’t clear, but an explanation from here( helped. There are enough examples already that explain ggplot that I am not going into details.

ggplot(df, aes(x = n.year, y =count))+geom_bar(stat = "identity",fill = "darkblue") + coord_flip() +labs(x = "Year", y = "Count", title = "Count of MEASLES over the entire range") +theme(axis.text = element_text(size = 12, color = "black"), axis.title = element_text(size = 14, color = "black"), plot.title = element_text(size = 16, color = "black"))

ggplot cases 1900 1950

With my confidence up, time to try something more interesting, more R like.
The query below returns all cases of the disease Measles. I want to try and fit a linear model to the results.

query = "match (d:Disease)<-[:CASE_OF]-(dt:Case)--(ct:City)(st:State) where d.description ="MEASLES" return dt.year,count(*) as count"

measles all years, no line

Fitting a linear model to the data.
model = lm(formula=count ~ dt.year, data=df)
And then draw the “Line of Best Fit”
measles all states all years

By adding a state qualifier to the query I could view measles counts for a given state.
MEASLES Maine all yearsMEASLES  Texas all yearsMEASLES Florida all yearsMeasles CA all years

It is interesting to see the differences between the states. Most of the queries showed a peak around the 1920’s and then a decline. I have found some information that indicates a peak or elevated number of cases reported in the 1920’s. The data would match that. What seems to be up for discussion is why. I am not an epidemiologist so I won’t go any further.

Posted in Uncategorized | Leave a comment

Graph Database – project Tycho part 5

The loading process was taking too long. I felt there were several options ideas, was to switch to Java and load the data with an embedded connection. Another was to try and use the batch loaded. I looked at this and it looked difficult considering how my data was arranged. Or maybe I am not that smart? The final option was to py2Neo and its batch mode. I really want to stay with Python for now so I choose the later.

The batch process is a bit different and it took some time to rework the code. I have the data broken down by states. I tested loading a complete state in one batch as well as breaking each state data into smaller batches.

This code is run once at the beginning.

 gdb =Neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
 batch = neo4j.WriteBatch(gdb)
Create an index:
  city_idx = gdb.get_or_create_index(neo4j.Node,'Cities')

for each city in each state create a city node. The State nodes are created in the same fashion. After both nodes are created a relation is created between the two.

 cityNode  = batch.create(node(name=cityName ))
 batch.set_properties(cityNode, {'description':cityName})
 # relate the state and the node:
   batch.create(rel(stateNode, "CITY_IN_STATE", cityNode))

There wasn’t any significant difference. The time was better. The old process would run all night. The batch process would complete in 4-5 hours. I didn’t expect to load the data over and over so this was okay.

There seemed to be an issue…

With the complete data set loaded the size of database was 1.2G, maybe that was okay? The next step was to try a few queries. One of the first things I tried was to count the number of nodes of a given type. The data I pulled from the site only had eight different diseases. So I was expecting to see a count of eight. I was surprised to find a lot more(I forget how many, just a lot more than eight). It was clear I was doing something incorrect. Clearly the extra nodes needed to be removed.

The Data.
In the files, each row contains either a death or reported case event. It also contains the disease. What I discovered was that for each event I was creating a new disease node. Not what I wanted! What I had intended to have was one node for each of the diseases and reference them to the events. The same process that had been used for states and cities.

Code re-work
There was already a file containing the dies ease information so it was simple to process it and create the required nodes. As each event was processed I was going to need to associate the appropriate disease node. Knowing there are records the last thing I wanted was to query the data base each time. After all the batch process was supposed to speed things up. This new change was likely to make the process much slower. The solution was to cache the disease nodes in a python list.

At the start of the load process all eight of the disease nodes were created and the node was added to the map, the key was the disease name.

 simplelist = {}

Once all of the diseases are load they are retrieved and added to a list. The ‘description’ is the disease name. The node is stored in the list.

      query = neo4j.CypherQuery(gdb, "MATCH (n:`Disease`) RETURN n")
      records = query.execute()
      for n in records:
        node = n[0]
        print(node ["description"])


When it came time to load the events all that was required was to parse out the disease name and locate the node in the map.

The process still takes 2-3 hours to complete but the size is down to about 800M. Running a variety of queries turned up an issue I hadn’t counted on, Some States had cities that didn’t belong. Was is the raw data or a bug in my load process. I looked at the raw data files and found the problem was there. I deleted the data and pulled it from the site again. This time I didn’t see the problem. After another reload of the database the queries looked better.

The total node count is about 700k but most queries run pretty quick.

Make the database public
As mentioned in an earlier section I loaded the database on to an Amazon EC2 micro instance. I wanted something that was free for the immediate future. I don’t have the extra money spend just for the heck of it.
The first time I loaded the data I had selected only about a third of the information. This made it easier to up load. With the corrected structure and all of the data I wanted to push the entire set up. All seemed fine until I tried anything beyond a very simple query. Using the servers browser viewer I would routinely get messages about being disconnected. But the queries ran fine on my laptop so I wasn’t sure if this was an installation/configuration issue on EC2. I had the idea to look at the EC2 console and see if that would tell me anything. It did, the cpu was showing 100%. Apparently the micro instance has less power the i7 on my laptop? You what you pay for I suppose.

I went back and created a new database again with only a third of the data( but with the new structure), it works on EC2 but its not what I really wanted.

Below are some queries from the data set.

 match (d:Disease)<-[:CASE_OF]-(dt:Case)--(ct:City)<-[:CITY_IN_STATE]->(st:State) where st.description ="CA" and dt.year=1920 and dt.week=10 return st,ct,dt,d



match (d:Disease)<-[:DEATH_FROM]-(dt:Death)--(ct:City)<-[:CITY_IN_STATE]->(st:State) where st.description ="CA" and dt.year=1920 and dt.week=10
return st,ct,dt,d


match (d:Disease)<-[:DEATH_FROM]-(dt:Death)--(ct:City)<-[:CITY_IN_STATE]->(st:State) where st.description ="ME" and dt.year=1920 and dt.week=10
return st,ct,dt,d
match (d:Disease)<-[:CASE_OF]-(dt:Case)--(ct:City)<-[:CITY_IN_STATE]->(st:State) where st.description ="ME" and dt.year=1920 and dt.week=10
return st,ct,dt,d


Posted in Uncategorized | Leave a comment

Graph Database – project Tycho part 4

The process of loading the database is slow. I’ll need to determine how to use the batch loader before I can utilize all of the data.

Looking at the data on my laptop was interesting but wanting to share this meant doing more. I decided to go the Amazon EC route.

Update: The micro EC is not capable of handling even a third of the data. I have pulled the instance and will look for another hosting solution.

The next step is to create cypher queries. These two give an idea of what can be found.

match (st:State)-[:CITY_IN_STATE]->(ct:City)
where =”AL”
return ct;
match (st:State)-[:CITY_IN_STATE]-(ct:City)-[HAS_DEATH]-(d:Death)
where =”CA” and d.year=1920
return ct,d

Since this is still in a testing phase I have limited the data set to 50 cases and deaths for each city. Also need to geocode the cities so interesting spatial queries can be done.

Next, use Xarmin to develop an app to view and query the data.

Or… node.js with D3.js  ?


Posted in Uncategorized | Leave a comment