A year with data: Day 2

Wow! two days in a row, good for me.

Where to start… Probably with the data.

Project tycho.
The project has gathered data(level 2) over a 126 year period(1888 to 2014). Divided in to cases and deaths it include fifty diseases, fifty states and 1284 cities. Access is via a web service. There are calls to get a list of all diseases, states, cities, cases and deaths. Using Python, I pulled the various pieces and stored each is a file. The process takes a while it was better to get the data once and then format it as needed. For each state/city I also obtained the lat/lon information. Finally I gathered all of the data into one file where each record looked like the ones below:

Event St City Disease Year Week Count Lat Lon
Case, AK, KETCHIKAN,MEASLES, 1914, 24, 1, 55.34222219999999, -131.6461111
Death,AK, KETCHIKAN,MEASLES, 1914, 24, 1, 55.34222219999999, -131.6461111

The process:
1. Get all diseases.
2. Get all States.
3. For each State get all cities.
4. For each State/City geocode the city.
5. For each Disease.
For each State and City get events.

Some python code:

The code below gives examples of how to pull the data from the Tycho site. The key is assigned by them. I found some cases where there are ‘[‘ and ‘]’ characters is the data. Since I couldn’t determine what to do with this I simply skip it. I also check for commas and spaces which make parsing difficult.

def get_disease(key):
    listOfdisease = []
    url = 'http://www.tycho.pitt.edu/api/diseases?apikey='+key 
    response = urllib.request.urlopen(url)
    html = response.read()
    xml = et.fromstring(html.decode('utf8'))
    myfile = open("data/disease"+".data","w")
    for element in xml.findall("row") :
        type = element.find("disease")
        # remove characters we dont want '[' ']' '/'
        if  not  "[" in  type.text and  not  "]"  in  
        type.text and not  "/" in type.text:
            type.text = type.text.replace(" ","_")
            type.text =type.text.replace(",","")
            print (type.text )

    return listOfdisease

Find the state from the string. Each state is defined by the tag ‘loc’

   xml = et.fromstring(html.decode('utf8'))
    for element in xml.findall("row") :
        StateAbv = element.find("state")

        State = element.find("loc")

Finding cases or deaths is bit more complicated. The field ‘number’ represents the number of events for that period.

 for element in xml.findall("row") :
        year = element.find("year")
        week = element.find("week")
        number = element.find("number")
        if int(number.text)  > 0:
            case = Case(disease,year.text,week.text,number.text,state)  

(Moving to GitHub soon).

The hardware.  For the most part I just use a Windows laptop. I need to run Hadoop and Spark and since I use the laptop for work I need a different solution.Something that I can run without disruption . Hadoop is marketed as running on commodity hardware. Lets see. I have two old systems that I have installed Linux(Ubuntu) on. Also I installed Hadoop and a host of other support stuff. I need to get these set up as a cluster at some point.

About gricker

Living and learning
This entry was posted in Uncategorized. Bookmark the permalink.