Graph Database – project Tycho part 5

The loading process was taking too long. I felt there were several options ideas, was to switch to Java and load the data with an embedded connection. Another was to try and use the batch loaded. I looked at this and it looked difficult considering how my data was arranged. Or maybe I am not that smart? The final option was to py2Neo and its batch mode. I really want to stay with Python for now so I choose the later.

The batch process is a bit different and it took some time to rework the code. I have the data broken down by states. I tested loading a complete state in one batch as well as breaking each state data into smaller batches.

This code is run once at the beginning.

 gdb =Neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
 batch = neo4j.WriteBatch(gdb)
Create an index:
  city_idx = gdb.get_or_create_index(neo4j.Node,'Cities')

for each city in each state create a city node. The State nodes are created in the same fashion. After both nodes are created a relation is created between the two.

 cityNode  = batch.create(node(name=cityName ))
 batch.set_properties(cityNode, {'description':cityName})
 # relate the state and the node:
   batch.create(rel(stateNode, "CITY_IN_STATE", cityNode))

There wasn’t any significant difference. The time was better. The old process would run all night. The batch process would complete in 4-5 hours. I didn’t expect to load the data over and over so this was okay.

There seemed to be an issue…

With the complete data set loaded the size of database was 1.2G, maybe that was okay? The next step was to try a few queries. One of the first things I tried was to count the number of nodes of a given type. The data I pulled from the site only had eight different diseases. So I was expecting to see a count of eight. I was surprised to find a lot more(I forget how many, just a lot more than eight). It was clear I was doing something incorrect. Clearly the extra nodes needed to be removed.

The Data.
In the files, each row contains either a death or reported case event. It also contains the disease. What I discovered was that for each event I was creating a new disease node. Not what I wanted! What I had intended to have was one node for each of the diseases and reference them to the events. The same process that had been used for states and cities.

Code re-work
There was already a file containing the dies ease information so it was simple to process it and create the required nodes. As each event was processed I was going to need to associate the appropriate disease node. Knowing there are records the last thing I wanted was to query the data base each time. After all the batch process was supposed to speed things up. This new change was likely to make the process much slower. The solution was to cache the disease nodes in a python list.

At the start of the load process all eight of the disease nodes were created and the node was added to the map, the key was the disease name.

 simplelist = {}

Once all of the diseases are load they are retrieved and added to a list. The ‘description’ is the disease name. The node is stored in the list.

      query = neo4j.CypherQuery(gdb, "MATCH (n:`Disease`) RETURN n")
      records = query.execute()
      for n in records:
        node = n[0]
        print(node ["description"])


When it came time to load the events all that was required was to parse out the disease name and locate the node in the map.

The process still takes 2-3 hours to complete but the size is down to about 800M. Running a variety of queries turned up an issue I hadn’t counted on, Some States had cities that didn’t belong. Was is the raw data or a bug in my load process. I looked at the raw data files and found the problem was there. I deleted the data and pulled it from the site again. This time I didn’t see the problem. After another reload of the database the queries looked better.

The total node count is about 700k but most queries run pretty quick.

Make the database public
As mentioned in an earlier section I loaded the database on to an Amazon EC2 micro instance. I wanted something that was free for the immediate future. I don’t have the extra money spend just for the heck of it.
The first time I loaded the data I had selected only about a third of the information. This made it easier to up load. With the corrected structure and all of the data I wanted to push the entire set up. All seemed fine until I tried anything beyond a very simple query. Using the servers browser viewer I would routinely get messages about being disconnected. But the queries ran fine on my laptop so I wasn’t sure if this was an installation/configuration issue on EC2. I had the idea to look at the EC2 console and see if that would tell me anything. It did, the cpu was showing 100%. Apparently the micro instance has less power the i7 on my laptop? You what you pay for I suppose.

I went back and created a new database again with only a third of the data( but with the new structure), it works on EC2 but its not what I really wanted.

Below are some queries from the data set.

 match (d:Disease)<-[:CASE_OF]-(dt:Case)--(ct:City)<-[:CITY_IN_STATE]->(st:State) where st.description ="CA" and dt.year=1920 and dt.week=10 return st,ct,dt,d



match (d:Disease)<-[:DEATH_FROM]-(dt:Death)--(ct:City)<-[:CITY_IN_STATE]->(st:State) where st.description ="CA" and dt.year=1920 and dt.week=10
return st,ct,dt,d


match (d:Disease)<-[:DEATH_FROM]-(dt:Death)--(ct:City)<-[:CITY_IN_STATE]->(st:State) where st.description ="ME" and dt.year=1920 and dt.week=10
return st,ct,dt,d
match (d:Disease)<-[:CASE_OF]-(dt:Case)--(ct:City)<-[:CITY_IN_STATE]->(st:State) where st.description ="ME" and dt.year=1920 and dt.week=10
return st,ct,dt,d


About gricker

Living and learning
This entry was posted in Uncategorized. Bookmark the permalink.