Graph Database – project Tycho part 3

With the data in files by state it is time to create and fill the database.

Again I went with Python since it was simple to connect using REST. The first step was to determine the nodes and relationships. As far as I can tell there is no clear path in this area. With a relational database one would start with tables and then add foreign keys. Using Neo4j I decided to create nodes based on key items in the data.

State, City, Dies ease, Case, Death

Relationships are the glue that connects nodes. Unlike FK they can be more descriptive. Consider State and City. The relationship I considered was “HAS_A”. A State “HAS_A” City. I chose “CITY_IN_STATE”. City “CITY_IN_STATE” State.



State -> CITY_IN_STATE -> City
Case -> CASE_OF -> Disease
City -> HAS_CASE -> Case
Death-> DEATH_FROM -> Disease
City -> HAS_DEATH -> Death

Using Python to create the database.

gdb = GraphDatabase(“localhost:7474/db/data”)
state_idx = gdb.node.indexes.create(‘states’)
city_idx = gdb.node.indexes.create(‘cities’)
disease_idx = gdb.node.indexes.create(‘diseases’)
stateLabel = gdb.labels.create(“State”)
cityLabel = gdb.labels.create(“City”)
diseaseLabel = gdb.labels.create(“Disease”)

I then created functions to create nodes driven by the data files.

def create_city(name):
cityNode  = gdb.node(name=name, description=name)
city_idx['name'][name] = cityNode
return cityNode

There are functions for States, cases, diseases, and deaths.
Create a city and state node. Then create the relationship:

   cityNode = create_city(cityName)
   stateNode.relationships.create("CITY_IN_STATE", cityNode)

Below is how cases are created and the relationships built.

caseNode= create_case_event(caseName,"CASE",year,week,number,stateName)
caseNode.relationships.create("CASE_OF", diseaseNode)
cityNode.relationships.create("HAS_CASE", caseNode)

The results

Posted in Uncategorized | Leave a comment

Graph Database – project Tycho part 2

Retrieve the data.

The Tycho project supplies a REST interface to query various data. An API Key is required to access the data. One of the first queries is a list of the diseases in the set. The result looks something like:


Currently I restricted the diseases to:


I chose to limit the data until I am confident the model is correct and I figure out a faster way to import the data.

I decided to get the data I wanted and store it by state in  simple csv files. This way I could experiment with the data without having to go back to the website each time.

Using Python, its easy to construct a process to query and save the files. The first step is to get the disease list. Next, get a list of the states. Then start a loop through each state and get the list of cities. For each city, get events (cases or deaths) for each disease.

This is my first attempt with Python and the code is pretty rough. I created classes (State,City, Case, Death) to hold the information such as year, week , state and city. Each file is called state.dat. Because I have had to rerun the process I only pull data where there is no file.

 listOfdisease = get_disease()
 listOfStates = getStates()
 for stateName in listOfStates:
    if not os.path.isfile(stateName+".data"):
    myfile = open(stateName+".data","w")
    listOfCities = getCities(stateName)
      for cityName in listOfCities:
        for disease in listOfdisease:
             getCases(stateName ,cityName ,disease )
             getDeaths(stateName ,cityName ,disease )
             for case in listOfCases:

             for death in listOfDeaths:
                 myfile.write("Death,"+ d+","+death.year+",

Load the database.

Posted in Uncategorized | Leave a comment

Graph Database – project Tycho

In the Fall of 2013 the University of Pittsburgh published a great store of public health data. The data includes cases (and deaths)  of reported  diseases for the United States as far back as 1888.

  • Events: cases and deaths
  • Diseases: 47
  • Locations: 50 states, and 1287 cities
  • Covered Years: 1888 to 2013

I am interested in how public health information can be used to better manage outbreaks and vaccinations. This seemed like a great resource to dive into.

The website offers the ability to query various aspects of the data set. Queries such as searching for cases by disease and state, or deaths from disease  by state or city. I wanted to be able to look at the data from a spatial relationship point of view. To do this I needed the data in a different format.

Using the graph database, Neo4j, immediately came to mind. Graphs are about relationships and this data fits that very well.


State “has” city

City “has” Event

Event “is” caseOf  or  deathFrom

Both caseOf and deathFrom have specific information about year, week, number of events.

The first  thing is to retrieve the data and create a graph database. Next geocode each city and then develop queries to see what one can learn.

Retrieve the data

Posted in Uncategorized | Leave a comment