Wow! two days in a row, good for me.
Where to start… Probably with the data.
The project has gathered data(level 2) over a 126 year period(1888 to 2014). Divided in to cases and deaths it include fifty diseases, fifty states and 1284 cities. Access is via a web service. There are calls to get a list of all diseases, states, cities, cases and deaths. Using Python, I pulled the various pieces and stored each is a file. The process takes a while it was better to get the data once and then format it as needed. For each state/city I also obtained the lat/lon information. Finally I gathered all of the data into one file where each record looked like the ones below:
Event St City Disease Year Week Count Lat Lon
Case, AK, KETCHIKAN,MEASLES, 1914, 24, 1, 55.34222219999999, -131.6461111
Death,AK, KETCHIKAN,MEASLES, 1914, 24, 1, 55.34222219999999, -131.6461111
1. Get all diseases.
2. Get all States.
3. For each State get all cities.
4. For each State/City geocode the city.
5. For each Disease.
For each State and City get events.
Some python code:
The code below gives examples of how to pull the data from the Tycho site. The key is assigned by them. I found some cases where there are ‘[‘ and ‘]’ characters is the data. Since I couldn’t determine what to do with this I simply skip it. I also check for commas and spaces which make parsing difficult.
def get_disease(key): listOfdisease =  url = 'http://www.tycho.pitt.edu/api/diseases?apikey='+key response = urllib.request.urlopen(url) html = response.read() xml = et.fromstring(html.decode('utf8')) myfile = open("data/disease"+".data","w") for element in xml.findall("row") : type = element.find("disease") # remove characters we dont want '[' ']' '/' if not "[" in type.text and not "]" in type.text and not "/" in type.text: type.text = type.text.replace(" ","_") type.text =type.text.replace(",","") print (type.text ) listOfdisease.append(type.text) myfile.write(type.text+"\n") myfile.close() return listOfdisease
Find the state from the string. Each state is defined by the tag ‘loc’
xml = et.fromstring(html.decode('utf8')) for element in xml.findall("row") : StateAbv = element.find("state") State = element.find("loc") listOfStates.append(StateAbv.text)
Finding cases or deaths is bit more complicated. The field ‘number’ represents the number of events for that period.
for element in xml.findall("row") : year = element.find("year") week = element.find("week") number = element.find("number") if int(number.text) > 0: case = Case(disease,year.text,week.text,number.text,state)
(Moving to GitHub soon).
The hardware. For the most part I just use a Windows laptop. I need to run Hadoop and Spark and since I use the laptop for work I need a different solution.Something that I can run without disruption . Hadoop is marketed as running on commodity hardware. Lets see. I have two old systems that I have installed Linux(Ubuntu) on. Also I installed Hadoop and a host of other support stuff. I need to get these set up as a cluster at some point.