Like a lot of people I grew up with video games. But these were quit different from what we have today. Space invaders, Lunar lander, Missile Command and Asteroids look like cave drawings when compared to what is available today.  I have experimented with tools like LightWave and Maya but their costs are prohibitive and they are not really suited for amateur game developers. Unity 3D, on the other hand, is ideally suited for those just getting started with game development. In addition, it can easily support more complex professional games. Their recent announcement for free support for mobile applications means its time for me to make the leap.

A modern game typically requires a lot of people, mainly artists, to create scenes and characters. I can  use tools such as Blender but I am not nearly proficient enough to build the images as well as create the game. I need a game where I can leverage existing art work and just focus on the mechanics of  the game and learning Unity.

What I need is a  2D side scrolling space game. I decided on trying to replicate the Lunar Lander game.

old-lunar-lander

It won’t be an exact match but instead more updated and something that fits with the Unity model. Look around in the Apple and Google app stores and you can find a number of these games. Some are 2D while others are 3D and much more realistic. I am not trying to be the next “Flappy Bird” so I don’t expect to compete with other games. Its all about the learning.

Unity 3D

A lot can bee done with Unity right out of the box. Anything that requires reacting to a user(player) in going to bring up the need to add custom coding. There are two choices for doing this, C# and Javascript.  A lot of the tutorials and examples are in Javascript so I’ll stick with this.

The Game

The point of game is to land the ship on the surface before you run out fuel and crash. In the earlier games the ship would rotate as well as translate. Correcting the rotation makes the game much more difficult to play. For this version I’ll stick with simple translation left, right, up and down. Of course there needs to be a surface to land on. A simple flat surface is boring. Adding some sort of obstacles will make it a bit more challenging.

Things to consider:

  • The ship
  • Obstacles
  • Landing
  • Movement
  • Gravity
  • Fuel
  • Crashing
  • Player controls
  • Scoring
  • Sound

The Ship

Unity can import models from many tools such as Blender and Max 3D. For a mobile game the model can not be too complex. The more detailed the poorer the game performance will be. I found a reasonably sized lunar lander model from NASA that is free to use.

nasa-model

Obstacles

In the original game the surface changed from flat to mountains. I decided to add rocks to a flat surface. In order to make things a bit more complex I added the rocks at random locations and sizes.

rocks rocks2

Landing

The rocks provide obstacles to avoid but there needs to be a ‘safe’ landing place. These are marked ‘green’ so the player can be seen. Since the rocks are randomly placed the landing places need to be adjusted as well. The process is to place a landing spot and then place the rocks. The code has to make sure the rocks are not covering the landing place and that there is enough room for the lander.

Startup code to build the scene:

Declare the rocks and landing pads

var rocks: Transform[];
var landingPads: Transform[];

Find the game object tagged GUI so that we can determine the player’s level. The landing pads are adjusted differently once the player is beyond level one.

Create the landing pads by varying the “x” value.

GUI = GameObject.FindGameObjectWithTag("GUI").GetComponent(InGameGUI);
 if(GUI.playerLevel > 1)
 startx = (GUI.playerLevel * 1.1)* 4895.0;
 else
 startx = 4895.0;
 currentXoffset =startx + 1200*Random.Range(3,10);
 for(i =1; i < numberOfLandingPads; i++) {
 lp = Instantiate(landingPads[0], Vector3 (currentXoffset,-69.0, 514.6719), Quaternion.identity);
 lp.transform.localScale.x = 160;
 lp.transform.localScale.y = 1.1;
 lp.transform.localScale.z = 160;
 lp_locations[lp_locations_index,0] = currentXoffset;
 lp_locations[lp_locations_index,1] = (lp.transform.localScale.x*5);
 
 currentXoffset += (lp.GetComponent.().bounds.size.x*Random.Range(3,6));
 lp_locations_index++; 
 }

Create a 1000 rocks. Each rock is generated in a random x location. The height of each rock is also random( y direction). The game is 2D but I am using Unity in 3D mode. For creating the rocks I am creating a 3D field. At some point I may change the game to be more 3D.  Each rock is check to make sure that it doesn’t  overlap with a landing pad. I didn’t want the code to get stuck in the overlap process so after 10 tries I give up.

for (var x = 0; x < 1000; x++) {  var breakOut=0;  do {  var index = Random.Range(0,4);  var locX = Random.Range(-50000,50000);  var locZ = Random.Range(-3000,2000);  var scaleX = 200;//Random.Range(Random.Range(5,50),Random.Range(150,200));  var scaleY = Random.Range(Random.Range(5,50),Random.Range(70,500));  if(GUI.playerLevel >2)
 {
 scaleY = Random.Range(Random.Range(5,50),Random.Range(70,GUI.playerLevel*500));
 }
 var scaleZ = 400; //Random.Range(Random.Range(5,50),Random.Range(50,100));
 // Debug.Log( " Creating rocks locX "+locX + " locZ " +locZ +" scaleX " +scaleX+ " scaleY " +scaleY); 
 breakOut++;
 if(breakOut > 10)
 {
 // Debug.Log("==============breakOut++++++++++++");
 return;
 }
 } while (checkOverlap(locX) );
 
 rock = Instantiate(rocks[index], Vector3 (locX, 0, locZ), Quaternion.identity);
 rock.transform.localScale.x = scaleX;
 rock.transform.localScale.y = scaleY;
 rock.transform.localScale.z = scaleZ; 
 rock.tag = "rock";
 
}

A lot of values are hardcoded simply for expedience.  Good software practice would be to use variables or contestants

Movement

Since the game has more than one or two controls it requires the addition of buttons. Keyboard controls are not an options and  multi-touch is complicated. I need to control the main engine(up), left and right thrusters and a pause button.

A ParticleEmitter is used to indicate engine or thruster action.

var engineThruster : ParticleEmitter;
var LeftThruster : ParticleEmitter;
var RightThruster : ParticleEmitter;

An audio file is played when the engine is on. While the engine button is pressed the emitter is set too true

// if the Emitter is not running then fire it
// and play the sound
// then move ship up
 if(engineThruster.emit == false)
 {
 GetComponent.().PlayOneShot(engineSound);
 engineThruster.emit = true; 
 }
 moveShip_up();
function moveShip_up(){
 
 var dir:Vector3;
 // if we are out of fuel then do not move the ship
 if(fuelMeterCurentValue == 0)
    return;
 
 // update the fuel status
 updateFuel();
 
// get the local pos
 pos = Camera.main.WorldToScreenPoint(transform.position);
 
 // if the ship is higher than the screen 
 // set the velocity to 0
 if( pos.y >= Screen.height)
 {
     transform.GetComponent.().velocity.y=0.0;
 }
 else
 {
 // yMovement is either 1 or 0 depending on the button pushed
 // it limits movement to X or Y movement only
 // adjust the upward velocity the further away from the ground.
 // the value '200' should be replaced with ratio of the screen 
 // height
 if( (pos.y < ceiling) && (pos.y > Screen.height-200))
 {
     dir = Vector3(0,yMovement*upwardThrust/2.0,0);
 }
 else 
 {
    if( (pos.y < Screen.height-200) && (pos.y > Screen.height/2))
    {
      dir = Vector3(0,yMovement*upwardThrust/1.5,0);
    }
    else
   {
     dir = Vector3(0,yMovement*upwardThrust,0);
   } 
 } 
   // add force to the ship
   GetComponent.().AddForce(dir);

  }
}

Gravity

The assumption is that the planet has gravity. I have left the gravity setting standard as Unity sets it.

Fuel

Fuel usage is adjusted when ever the engine is running. In the FixedUpdate() Unity function the fuel is adjusted:

 fuelMeterCurentValue -=fuelLossRate*Time.deltaTime;

The term  Time.deltaTime increments the fuel usage according to the FixedUpdate() rate. It is standard in Unity to do this when doing something in the fixed update call.

Crashing

There are two ways to fail a landing. One is to land on rocks. The other is to land too fast. A vertical velocity indicator turns red when the ship is landing too fast. When the  ship touches the landing pad the velocity is checked. The function OnCollisionEnter() is called when two objects touch. In this case it will be the ship and either a landing pad or a rock. setting Time.timeScale to zero stops the game play. the GUI.guiMode is set to either win or lose. This will cause the correct screen to be displayed and the score to be adjusted.

 if( theCollision.gameObject.tag == "landingpad" )
 {
   if( (theCollision.relativeVelocity.magnitude > 50.0) )
   {
    Explode(); 
    GUI.guiMode ="Lose";
    Time.timeScale = 0;
  } 
  else
 { 
   Time.timeScale = 0;
   GUI.guiMode ="Win";
 }  
}

PlayerControls

Since this is a mobile game there needs to be buttons for the player. A single touch would work if it was to run the lander engine. Left and right translations are harder. Touch to the left of the lander could go left and the same for right.  Since the lander moves it could move under the touch point and cause the movement to change. Buttons just seem easier.

Unfortunately Unity’s UI is not straight forward.The placement and operation of a button is pretty simple. Buttons are  GUITexture components. Getting the position and sizing correct for different size devices is a challenge. There is talk that future versions of Unity will have better UI tools.

In the FixedUpDate function I test each button.

for (var touch : Touch in Input.touches)
 {
    if (engineButton.HitTest (touch.position))
    {
      // handle engine event
    }
    if (leftThrusterButton.HitTest (touch.position))
    {
      // handle left thruster event
    }
.
.
.
.
}

Scoring

Scoring is pretty straight forward. Land successfully and you get a point and proceed  to the next level. Crash and you have to repeat the level. At each level the landing spots get harder to find. As the level increases I need to increase the fuel(or lower the rate at which its is used).

Sound

Sound is handled from an AudioSource component.

 GetComponent.().PlayOneShot(engineSound);

This plays the sound once. As long as the button is held down the sound will be played over and over. Playing the sound in a loop is possible for something like background music. For sounds like the engine or  thrusters I need short burst of sound.

Screen Shots

The ship approaching a landing pad. The vertical velocity is in white and positive. This indicates that the ship is moving up at rate within the range for landing.

landing-pad

Since the landing pads are randomly placed I found it hard to locate them and no run out of fuel. I added a overhead view in the upper right corner to guide the player towards a landing pad.

over-head

The left corner shows the fuel and velocity levels.

fuel-speed-menu

The ship over the rocks. The vertical velocity is in red and negative. This indicates that the ship is moving down at rate too large to land.

paused

Goggle Play

I decided to put the game on Goggle Play just to see how this process works.

Update: I see one person has complained that at a high level you just crash into the rocks. It could be that this is a fuel issue. The landing pads are too far away for the fuel usage rate.

https://play.google.com/store/apps/details?id=com.punkinsoft&hl=en

Once I get the iOS version to work I’ll put it on the Apple Store as well.

Posted in Uncategorized

back to the jobs data..

The task is to query the system each and store the results. The goal is to have sufficient data to process through a Hadoop/Spark process and perform text analysis. Many of the postings, especially from agencies, are duplicates. I want to see how well one could match job posting using Spark machine learning clustering.

Since the API will also return lat and lon there is the opportunity
to do some spatial analysis.

The API for the job site lets you filter by keyword, state, city, date and employer or employer/agency.
You can also limit the data returned each time. Using the ‘-‘ with a keyword will ignore listings that include that word.
At this stage I want to ignore jobs from agencies and contract jobs. This is because many agencies post the same job and many ignore the location,i.e post jobs in MA that are located in CA.

For the second part of this experiment I will change this to pick up all jobs and try to use to classification to identify similar jobs.

I define several lists:
1. The states to check
2. The language keywords to use
3. Skill set keywords

states = [‘ME’,’MA’,’TN’,’WA’,’NC’,’AZ’,’CA’]
languageset = [‘java’,’ruby’,’php’,’c’,’c++’,’cloture’,’javascript’,’c#’,’.net’]
skillset = [‘architect’,’team lead’]

The API expects a parameter dictionary to be passed in. The default dictionary is:

Besides using the “-” to ignore keywords I am setting “as_not” to two agencies that I know to ignore. “sr” and “st” are set to try and avoid contract jobs and agencies.

params = {
‘as_not’ : “ATSG+Cybercoders”,
‘sr’:”directhire”,
‘st’:”employer”,
‘l’ : “ma”,
‘filter’:”1″,
‘format’:”json”,
‘fromage’:”last”,
‘limit’ :”100000″,
‘start’: “0”,
‘limit’:”100000″,
‘latlong’:”1″,
‘psf’:”dvsrch”,
‘userip’ : “xxx.xxx.xxx.xxx”,
‘useragent’ : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)”
}
Each day(yes, I should set it up with cron to run) I run it.

I convert the data into a Result object.

class Result:
name,company,location,jobtitle,city,state,country,source,date,latitude,longitude,jobkey, expired,sponsored ,snippet,url,meta

It is probably over kill and I could simply skip this step and go right to the database. I really like the compactness of the object and it makes the code look cleaner.

for each languageset
for each state
query()
convert the data to a result object
get the url to the main posting
get the the page
use BeautifulSoup to parse the html
get the content section of the page
store the result in Neo4j database

add to Neo4j
use graph.merge_one() to create state,city and language nodes

create new Job node(jobkey)
job key is from the api and has a unique constraint to avoid adding the same one again.
set Job properties(lat,lon,url,snippet.content, posted date, poll data)
create relationships
Relationship(job, “IN_STATE”, state)
Relationship(job, “IN_CITY”, city)
Relationship(job, “LANGUAGE”, lang)

That is the code. after a few false starts I have been able to get it run and gather about 16k listings.

Results

Below are some of the query results. I used Cypher’s RegEx to match on part of the job title.

match (j:Job)–(state:State{name:’NH’}) where j.name =~ ‘Java Developer.*’ return j,state
match (j:Job)–(state:State{name:’NH’}) where j.name =~ ‘Chief.*’ return j,state

  • Java Developer in New Hampshire
  • Java Developer in Maine and New Hampshire
  • Chief Technology in New Hamphire
  • PHP Developer in all the states polled

java dev nh me

Java Developer in New Hampshire and Maine

java dev nh

Java Developer in New Hampshire

ct nh

Chief Technology New Hampshire

php dev

PHP Developer all States

One of the goals is run Spark’s machine learning lib against the data. As a first test I will count the words in the job title. In order to determine if the process is working I counted the words in job titles for New Hampshire. Now I have something to compare to after the Spark processing.
Below is a sample of the word count for all job polled in New Hampshire

word count
analyst 14
application 10
applications 7
architect 15
architecture 4
associate 3
automation 5
business 2
chief 2
cloud 4
commercial 3
communications 4
computer 2
consultant 4
database 4
designer 3
developer 59
development 15
devices 3
devops 3
diem 3
electronic 2
embedded 5
engineer 83
engineering 2
engineers 2
integration 4
java 31
junior 3
linux 3
management 4
manager 8
mobile 3

I have Hadoop and Spark running. I need to get mazerunner installed and run a few tests. Then the fun begins…

Posted in Uncategorized

Machine learning Scrum Sprint estimates

Another idea as part of my “A year with data” exploration.

Anyone who has worked in a Scrum/Agile environment understands the pain involved with task estimation. Some of the methods include shirt sizes (S, M, L),  the Fibonacci sequence (1, 2, 3, 5, 8.), powers of 2 (1, 2, 4, 8) and even poker. Then there is the process of dealing with disparate estimates. One person gives an estimate of 2 and another suggests its  13 . After some discussion its agreed that the task is a 8. At the end of the sprint maybe it turns out that the task really was a 5. It would be useful,and interesting, to determine how well people do in their estimation. Is person A always under estimating? Person B is mostly spot on,….

This seems like a good candidate for Machine Learning, supervised learning to be more specific. I am not sure how many teams capture  information from the estimation process but they should.

Basic information such as:

  • The original estimates for each team member
  • The final agreed upon estimates
  • The actual size of the task once completed

The data might look like this:

Task TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 Actual
1 1 8 1 13 5 8 2 5 13 8
2 3 8 5 8 8 5 3 1 8 5
3 2 5 5 5 5 2 1 8 1 3
4 8 5 6 3 1 2 2 13 5 5
5 3 5 5 8 8 8 8 13 13 13
6 1 3 5 1 1 1 1 2 5 2
7 1 3 5 1 1 5 8 5 3 2
8 5 3 5 3 2 1 1 3 2 1
9 8 8 6 5 8 8 13 3 5 5
10 2 5 5 8 8 8 8 8 8 13

The ‘training’ data consists of ten tasks, the estimates from each of the nine team members and actual value the task turned out to be. I choose the Fibonacci sequence as a method for estimates.  Another piece of information that could be useful is the estimate the team agreed upon. That could be compared to the actual value as well. I decided not to do since it hides the interesting information of each team members estimate. By using each team members input we could determine which ones are contributing more or which ones are further off in their estimates.

Gradient decent

I am not going to try and explain Gradient Decent as there are others much better qualified to do the that. I found the Stanford Machine learning course to be the most useful. The downside is that the course used Octave and  I want to use Python. There is a bit of a learning curve trying to make the change. Hopefully I have this figured out.

The significant equations are below.

The cost function J(θ) represents how well theta can predict the outcome.

h theta

Where xj(i) represents each team member’s estimate for all of the task.
x(i) represents the estimate(feature) vector of the training set.
θT is the transpose of the theta; vector

hθ(x(i)) is the predicted value.

The math looks like this.

j theta

 

 

For this I am using the following python packages

numpy
pandas
matplotlib.pyplot
seaborn

Note: In order to use seaborn residplot I had to install ‘patsy’ and ‘statsmodel’

easy_install patsy
pip install statsmodels

Set up pandas to display correctly
pd.set_option(‘display.notebook_repr_html’, False)

The first step is to read the data.

training_set = pd.read_csv('estimate data.txt')

Next, we need to separate the estimates from the actual values

 tm_estimates = training_set[['tm1','tm2','tm3','tm4','tm5','tm6','tm7','tm8','tm9']] 
 print(tm_estimates)   
        tm1  tm2  tm3  tm4  tm5  tm6  tm7  tm8  tm9
         1    8    1   13    5    8    2    5   13
         3    8    5    8    8    5    3    1    8
         2    5    5    5    5    2    1    8    1
         8    5    6    3    1    2    2   13    5
         3    5    5    8    8    8    8   13   13
         1    3    5    1    1    1    1    2    5
         1    3    5    1    1    5    8    5    3
         5    3    5    3    2    1    1    3    2
         8    8    6    5    8    8   13    3    5
         2    5    5    8    8    8    8    8    8


 actuals = training_set['est']
  
 sns.distplot(actuals)
 plt.show()

A distribution plot of the actuals
actuals dis plot

 

One thing to consider is the Normalization of the data. This is important when data values vary greatly. In this case the data is not all that different but its worth the effort to add this step.

normal

 mean = tm_estimates.mean()
    std = tm_estimates.std()

    tm_estimates_norm = (tm_estimates - mean) / std
    print(tm_estimates_norm) 
tm1 tm2 tm3 tm4 tm5 tm6 tm7 tm8 tm9
-0.147264 1.312268 0.143019 0.661622 1.031586 0.064851 -0.403064 -1.184304 0.405606
-0.515425 -0.145808 0.143019 -0.132324 0.093781 -0.907909 -0.877258 -1.184304 0.405606
1.693538 -0.145808 0.858116 -0.661622 -1.156627 -0.907909 -0.640161 0.441211 -1.264536
-0.883585 -1.117858 0.143019 -1.190919 -1.156627 -1.232162 -0.877258 1.602294 1.598564
-0.883585 -1.117858 0.143019 -1.190919 -1.156627 0.064851 0.782419 -0.952088 -0.310169
0.589057 -1.117858 0.143019 -0.661622 -0.844025 -1.232162 -0.877258 -0.255438 -0.787353
1.693538 1.312268 0.858116 -0.132324 1.031586 1.037610 1.967903 -0.719871 -1.025944
-0.515425 -0.145808 0.143019 0.661622 1.031586 1.037610 0.782419 0.441211 0.405606

To satisfy the equation we need to add an extra column for theta0. For that we add x0 and set all of the values to 1

# the number of data points
m = len(tm_estimates_norm)
#add the x0 column and set all values to one.
tm_estimates_norm['x0'] = pd.Series(np.ones(m))

Next we define the learning rate alpha to be 0.15. The number of iterations is 150. Setting these two values will control how well the cost function converges.

    alpha = 0.15
    iterations = 150

    

Set the initial values of theta to zero. Then convert the data into numpy arrays instead of python strutures.

  
    # Initialize theta values to zero
    thetas = np.zeros(len(tm_estimates_norm.columns))
    
 
    tm_estimates_norm = np.array(tm_estimates_norm)
    estimations = np.array(actuals)
    print(estimations)
    cost_history = []

Now do something!
First calculate the prediction. Theta . estimates.
Next perform the the J(0) calculation
Calculate the cost and record the cost. This last step will tell us if the process is decreasing or not.

    for i in range(iterations):
    # Calculate the predicted values
        predicted = np.dot(tm_estimates_norm, thetas)

        # Calculate the theta 
        thetas -= (alpha / m) * np.dot((predicted - estimations), tm_estimates_norm)
    
        # Calculate cost
        sum_of_square_errors = np.square(predicted - estimations).sum()
        cost = sum_of_square_errors / (2 * m)

        # Append cost to history
        cost_history.append(cost)

I tried different combinations of alpha and iterations just to see how this works.

The first attempt is using alpha = 0.75
high learning rate 75

This next try uses alpha = 0.05 and iterations = 50
lower learning rate 05

This last on represents alpha = 0.15 and iterations = 150
figure_2

7.923283-0.076717 5.4614750.461475/td>  3.4814650.481465 4.404572-0.595428  14.2873011.287301 1.225380–0.774620  2.7378480.737848 /td>  .467125.0207895.020789- 0.020789 10.990762-2.009238

actuals predictions difference
8 7.923283 -0.076717
5 5.461475 0.461475
3 3.481465 0.481465
5 4.404572 -0.595428
13 14.287301 1.287301
2 1.225380 -0.774620
2 2.737848 0.737848
1 1.467125 0.467125
5 5.020789 /td> 0.020789
13 10.990762 -2.009238

This graph shows the linear fit between the predicted and actual values

lm

This graph shows the difference between the predicted and actual values

resids

The data set is far too small to declare anything. The cases where the actual was high there is less data and the error is greater. In order to get more data I’ll have to make it up. Having worked in development for years( many) I know that people tend to follow a pattern when giving estimates. Also the type of task will dictate estimates. A UI task may seem simple to someone familiar with UI development. While a server/backend person may find a UI task daunting. In deciding how to create sample data I devised a scheme to give each team member a strength in  skills, UI, database, and server. Also each member has a estimation rating. This defines how they tend to rates tasks, low, high, mix or random. Once I get this working I start over and see how this new data performs.

 

Until then…

Posted in data

A year with data: Day 2

Wow! two days in a row, good for me.

Where to start… Probably with the data.

Project tycho.
The project has gathered data(level 2) over a 126 year period(1888 to 2014). Divided in to cases and deaths it include fifty diseases, fifty states and 1284 cities. Access is via a web service. There are calls to get a list of all diseases, states, cities, cases and deaths. Using Python, I pulled the various pieces and stored each is a file. The process takes a while it was better to get the data once and then format it as needed. For each state/city I also obtained the lat/lon information. Finally I gathered all of the data into one file where each record looked like the ones below:

Event St City Disease Year Week Count Lat Lon
Case, AK, KETCHIKAN,MEASLES, 1914, 24, 1, 55.34222219999999, -131.6461111
Death,AK, KETCHIKAN,MEASLES, 1914, 24, 1, 55.34222219999999, -131.6461111

The process:
1. Get all diseases.
2. Get all States.
3. For each State get all cities.
4. For each State/City geocode the city.
5. For each Disease.
For each State and City get events.

Some python code:

The code below gives examples of how to pull the data from the Tycho site. The key is assigned by them. I found some cases where there are ‘[‘ and ‘]’ characters is the data. Since I couldn’t determine what to do with this I simply skip it. I also check for commas and spaces which make parsing difficult.

def get_disease(key):
    listOfdisease = []
    url = 'http://www.tycho.pitt.edu/api/diseases?apikey='+key 
    response = urllib.request.urlopen(url)
    html = response.read()
       
    xml = et.fromstring(html.decode('utf8'))
    myfile = open("data/disease"+".data","w")
    for element in xml.findall("row") :
        type = element.find("disease")
        # remove characters we dont want '[' ']' '/'
        if  not  "[" in  type.text and  not  "]"  in  
        type.text and not  "/" in type.text:
            type.text = type.text.replace(" ","_")
            type.text =type.text.replace(",","")
            print (type.text )
            listOfdisease.append(type.text)
            myfile.write(type.text+"\n")
        

    myfile.close()    
    return listOfdisease

Find the state from the string. Each state is defined by the tag ‘loc’

 
   xml = et.fromstring(html.decode('utf8'))
    for element in xml.findall("row") :
        StateAbv = element.find("state")

        State = element.find("loc")
        listOfStates.append(StateAbv.text)

Finding cases or deaths is bit more complicated. The field ‘number’ represents the number of events for that period.

 for element in xml.findall("row") :
        year = element.find("year")
        week = element.find("week")
        number = element.find("number")
        if int(number.text)  > 0:
            case = Case(disease,year.text,week.text,number.text,state)  

(Moving to GitHub soon).

The hardware.  For the most part I just use a Windows laptop. I need to run Hadoop and Spark and since I use the laptop for work I need a different solution.Something that I can run without disruption . Hadoop is marketed as running on commodity hardware. Lets see. I have two old systems that I have installed Linux(Ubuntu) on. Also I installed Hadoop and a host of other support stuff. I need to get these set up as a cluster at some point.

Posted in Uncategorized

A year of data..

A year with data

I have been trying to work on data analysis for a while, but its been a lot of start and stop. I started with pure spatial data(University of Maine, Spatial Information) and then started working with public health data. Eventually I came to understand that the two are connected. Considering where events occurred can be helpful in understanding how to handle
public health issues. Some guy named Snow figured this out in the 1850’s. The big data movement has made things like machine learning,R, NLP, Hadoop, Pandas and Spark popular. I have decided to spend the next year mucking about with a couple of data sets to get a better idea of what can and can’t be done.

The data.
There are two data sets I plan to use(so far).
The first is public health information assembled by Project Tycho, University of Pittsburgh . The project has gathered public health data for over a hundred years. It consists of events(cases or deaths) due to disease. Each event is associated with a State, City, Year, and Day of the year. I have added Lat and Lon for each City.

The second set is being created each day(when I remember…). It involves pulling data from a job board using their api. This data is nice because it is changing every day. It also has a lot of free text that might be useful for NLP or classification.

Coding
My day job involves Java. For this effort I’d like to stay with Python. There are some exceptions where Java might make more sense, loading large data sets, or Hadoop MapReduce. I am using Django to create web apps as needed.
Python has it own analysis library, but works well with R. Probably a good path to stay with.

Storage
My favorite data store is Neo4j, the graph database.

Posted in Uncategorized

Multi-channel Attribution Using Neo4j Graph Database

Neo4j Graph:Multi-channel Attribution

Business Applications

Introduction

Globally more than $500 Billion was spent on advertising (Lunden, 2013). One of the greatest challenges of spending money on advertising is trying to understand the impact of those dollars on sales. With the proliferation of multiple mediums or channels (TV, search engines, social media, gaming platforms and mobile) on which precious marketing dollars can be spent, a Chief Marketing Officer (CMO) is in dire need of insights into the return on his investment in each medium. More importantly, the CMO needs timely data to prove that spending on a specific channel has a good return on investment. Neo4j can be used to help marketing applications get answers to tough questions:

  • How much was the increase in web awareness of the product after a commercial was aired in a specific TV channel on a specific date in a specific geographic area?
  • How much of that web awareness translated into…

View original post 613 more words

Posted in Uncategorized

Image processing to find tissue contours

The GRiTs project(https://rickerg.com/projects/project/) considered how genes interact with each other
in space and time. Evaluating this the process begins by determining the border structure of the image. This is done manually using a drawing tool and outlining the border. The resulting border data points were captured and feed into a tool such as 3dMax to create a multidimensional shell. This allowed the image tool a reference point for aligning gene data points in the volume.

I have been interested in processes that would make this more automated. The GRiTS viewer tools were developed in C++ and QT. ImageJ was also used to pre-process the images to remove some of the extraneous information within the image.

OpenCV

OpenCV is a C++ library designed for various image processing and machine vision algorithms. Here is a sample image that I am using.

The first thing is that some of the images have color and some are gray scale. The colors are used in some cases to indicate specific piece of information. In this case I am looking for contours and for that a gray scale image works better.

Gray

The function cvtColor() will work to covert to gray scale.In the version of OpenCV I am using the call takes the enum CV_BGR2GRAY as the option. In later version I believe this has changed to COLOR_BGR2GRAY. The function takes a src and dest image along with the appropriate code.

Blur

There are a number of blur options available. I have started out with the basic normalized box filter blur. I have set the kernel size as 3×3 to start with.

Threshold

This process removes unwanted values. But “unwanted” depends on the image and what you are trying to eliminate. In this case I am looking for pixels that make up the boundary. I don’t want to be harsh in removing values since this causes large gaps that are to fill. For this test I have selected the value to be 50 and the max value to be 250. Of course these values will change depending on the image. I suspect that this will require applying some statistics and machine learning to create “best guess”   starting values. After all the goal is make this as automatic as possible.

threshold(src, dest, threshold_value, max_BINARY_value, THRESH_TOZERO);

Edge

After blur and threshold I applied an edge detection process. For this I used the Canny algorithm.  The  lowThresh and highThresh are used to define the threshold levels for hysteresis process. The edgeThresh is the window size for aperture size for the Sobel operator.

int edgeThresh = 3;
double lowThresh = 20;
double highThresh = 40;
Canny(src, dest, lowThresh , highThresh , edgeThresh );

openview contour

Contour

The edge detection process creates a lot of segments. Contouring will try to connect some of the segments into longer pieces.

CV_RETR_EXTERNAL: find only outer contours
CV_CHAIN_APPROX_SIMPLE: compresses segments
Point(0, 0) : offset for contour shift.
Contours are stored in the contours variable, a vector<Vec4i>
findContours( edgeDest, contours, hierarchy,CV_RETR_EXTERNAL,  , CV_CHAIN_APPROX_SIMPLE, Point(0, 0) );

openview

ImageJ

I took the same processes used with OpenCV and implemented them with ImageJ. Using ImageJ is different then OpenCV. It is really designed so that the developer create plugins that the ImageJ tool can use. I expected to use ImageJ as a library, part of a bigger app.

imagej

This was used to create the above image. The raw image is on the left.

ImagePlus imp =IJ.openImage("Dcc29.jpg");
ImageProcessor rawip = rawimp.getProcessor();
rawip = rawip.resize(rawip.getWidth()/2,rawip.getHeight()/2);
ImageConverter improc = new ImageConverter(imp);
improc.convertToGray16();
ImageProcessor ip = imp.getProcessor();
ip.blurGaussian(1.6);
ip.threshold(185);
ip = ip.resize(ip.getWidth()/2,ip.getHeight()/2);
ip.erode();
ip.findEdges();

Scale over time

Another issue is scale. The complete set of images represent development over a period of time. In the beginning the image are small. By the end of the series they are considerably larger. Mapping positions on the cell images as they grow is still a challenge. Landmarks change over time, coming and going, so they can’t be counted on. The images are obtained at intervals which are relative close to each other. This means that points on one image will be close to others images in similar positions at similar time periods.

Consider the an image in the middle(image #20) at day 15. Points on this image should be close to points on a middle image at day 18.

By interpolating between images it may be possible to track point movement over a period of time.

Posted in Uncategorized