A year with data
I have been trying to work on data analysis for a while, but its been a lot of start and stop. I started with pure spatial data(University of Maine, Spatial Information) and then started working with public health data. Eventually I came to understand that the two are connected. Considering where events occurred can be helpful in understanding how to handle
public health issues. Some guy named Snow figured this out in the 1850’s. The big data movement has made things like machine learning,R, NLP, Hadoop, Pandas and Spark popular. I have decided to spend the next year mucking about with a couple of data sets to get a better idea of what can and can’t be done.
There are two data sets I plan to use(so far).
The first is public health information assembled by Project Tycho, University of Pittsburgh . The project has gathered public health data for over a hundred years. It consists of events(cases or deaths) due to disease. Each event is associated with a State, City, Year, and Day of the year. I have added Lat and Lon for each City.
The second set is being created each day(when I remember…). It involves pulling data from a job board using their api. This data is nice because it is changing every day. It also has a lot of free text that might be useful for NLP or classification.
My day job involves Java. For this effort I’d like to stay with Python. There are some exceptions where Java might make more sense, loading large data sets, or Hadoop MapReduce. I am using Django to create web apps as needed.
Python has it own analysis library, but works well with R. Probably a good path to stay with.
My favorite data store is Neo4j, the graph database.