Building BB8

Soon after the latest Star Wars movie came out, Sphero introduced its model of the BB-8 robot.


Soon afterward people were taking it apart to see how it worked.


Two designs emerged, the hamster cage


and single axis


A few people started posting DIY projects trying to build a “working” BB-8 robot.

I decided to try my hand at building a “working” BB8 as well. Starting in January with the goal of being ready for PortCon (Portland, ME ) in June.

The Sphere

There are three primary methods for constructing  the sphere.

  1.      Purchase a pre made plastic sphere( two halfs)
    1.      This can be expensive. There is also the issue of assembling the sphere
  2.      3D print various panels and then assemble them to form a sphere.
    1. Reading what others have said about this process its not simple. Because of the size of the panels and their complexity this is a difficult process. Besides being expensive its hard on the printer. A number of people report having to repair or replace their printers.
  3.      Construct a sphere from a material such as fiber glass.
    1. This started out to be the common method most people used. An early DIY project made this seem much simpler than it really is. This method involves covering  a ball,beach or yoga, with a paper/canvas mache mixture. The  BB8 community decided that the body is about 20 cm in diameter. The ball in the DIY project is not that big. As it turns out, finding a beach ball in Maine in January is impossible. So it was off to Amazon.


All three balls are listed as 20 cm. Hmmmm…

First attempt with paper and canvas, following the DIY project.

img_20160301_192948666 img_20160311_070328

Clearly this was not going to work. I decided to use fiber glass instead of canvas. I also found a 20 cm ball at a party store.

img_20160416_102655713 img_20160416_104108166 img_20160416_104059687

The Head

img_20160319_095557159 img_20160319_095709095 img_20160319_095811288 img_20160324_070928610

The Drive Train(part 1)

img_20160713_185919 img_20160713_185911 img_20160713_185900 img_20160616_230545 img_20160616_230535

June, Portcon  Portland Maine

Despite the drive train issues, BB8 still spoke and the light worked. So it was off to the con.


img_20160625_092812 img_20160625_205258061-1img_20160625_211230032

Its back to the drawing board with the drive system…

The new drive mechanism.

I started over with new motors, frame and servos. So far its looking a lot better.



Two videos before this all gets put back in the ball for a test run..



Posted in Uncategorized

Graph of a musical groups’ albums, songs and lyrics

The Idea

Being the dad of a teenage daughter means I listen to a lot of the current music. Lady Gaga, Taylor Swift. Recently is all about One Direction. As “” recently said “One Direction owns the internet in 2015. Sometimes I hear “this is a sad song” or “this is a happy one”. What could I learn about their music using Neo4j? Could one derive any sort of sentiment from the lyrics? Could I get my daughter interested in this? Only one way to find out…

How to start

The first step was to learn more about the group. There are currently four members but for most of their albums there were five. Harry Stiles, Niall, Liam, Zayn and Louis.They have released five albums, Four, Take me home, Up all night, Midnight memories and Made in the A.M. With the help of my daughter we found a site that had the lryics to all of the songs. What I found was that while some of the song files contained information about who was singing what section, many did not. I was hoping that maybe the sentiment could be aided by knowing the singer. Maybe Harry always sings sad/ break up songs(he did date Taylor Swift). Since this information isn’t consistent I couldn’t count on it.

Song sentiment ?

I felt it was important to have the ability to track lyrics by location in the song, row and column. This way one could query “what words appear the most often at the start(0,0) of a song? How often do certain word combinations( “I” and “you”) appear on the same line? This last question could be useful in better understanding sentiment?


Tools: Python, py2neo, R and RNeo4j.

The Model

The first step was to organize the songs into files by album. Once this was done it was simple to get Python to read in a list of albums, songs titles, and lyrics(words). The graph…

I decided that a Group node would refer to a band or singer. A group would be made up of members and members were artists. For bands this is fine. I made the choice to treat single acts the same as way. So Lady Gaga or Taylor Swift would be a considered a group,member and artist.


  • Group
  • Member
  • Artist
  • Album
  • Song
  • Lyrics


  • Album BY Group
  • Lyric IN Song
  • Song ON Album
  • Member ISA_ARTIST Artist
  • Group HAS_MEMBER Member


For the gist I restricted the data to one song per album and reduced the lyrics by two thirds. Even with this there are still 581 lyric nodes. There are 232 unique words. The difference is due to words being repeated but in different locations. The word “you” is found 28 times in the five songs.

Query 1

0 rows
5641 ms
| No data returned. |
Nodes created: 602
Relationships created: 609
Properties set: 1774

Find all songs where the word “my” appears

Query 2

MATCH (l:Lyric{name:"my"})-[r0:IN]-(s:Song) RETURN,l.row,l.column

Show distinct lyrics in the song “If I Could Fly”

Showing 1 to 10 of 56 entries

Query 4

MATCH (l:Lyric)-[r0:IN]- (n:Song) WHERE =~ "(?i)said" RETURN n,l

Show all lyrics in Act My Age.

Show all artists and members for the group

Show all songs on all of the albums. For the gist there is only one song per album.

Show all albums and members for the group

Show all of the lryics for the song “Kiss you”. There are some connections of lryics to other songs. This is becuase those lryics are used in the same location. The lryic “Baby” is used in “Kiss Me” and “What makes you beautiful” in the same row and column.

A query to find songs where the words ‘I’ and “you” are on the same line. The query works well in Python since I can filter out return values of 0. This type of search will be help when looking for phrases, words on the same line.

Query 5
MATCH (l1:Lyric{name: 'I'}) --(s:Song)
MATCH (l2:Lyric{name :'you'}) --(s:Song)
RETURN CASE WHEN l1.row = l2.row THEN [l1,l2,s] ELSE 0 END


Song Act My Age










Actual line, row 3 :”I can count on you after all that we’ve been through”

If I Could Fly










Actual line, row 5 :”I hope that you listen ’cause I let my guard down”

Sentiment and R

Below is a bar chart of the top ten most common lyrics. “I” and “you” are popular.

Sentiment The last thing to consider is sentiment. Using the simple process of positive and negative words I’d like to see if one make a determination of sentiment. There isn’t a song word list that I could find so I elected to use the AFINN list. Following examples from Jeffrey Breen and Andy Bromberg I was able to get some results. I didn’t divide the songs up into training and test sets, instead I picked two songs and processed them. My daughter suggested that “Best Song Ever” would be happy and “If I could Fly” would be sad.

The process start with a query:

graph = startGraph(“http://localhost:7474/db/data/”) query = “MATCH (l:Lyric) -[r0:IN]-(n:Song{name:’best song ever’}) RETURN”

ta = cypher(graph, query)

This returned a list of lryics. Next I counted the number of lyrics that matched a positive or negative word in the AFINN list. I classified the words into “reg”, scale 1-3 and “very” scale 4-5 for both positive and neg.

Using R functions naiveBayes() and predict(). The method is very simple but the results do follow that Best Song Ever “happier” then If I Could Fly. It would be good to get One Directions opinion on this.

“Best Song Ever” reg very positive 10 3 negative 3 0

“If I Could Fly” Reg very positive 1 0 negative 4 0

One thing I noticed is that simple word matching isn’t sufficient.For movie reviews or emails this may work. Song are more complex.

Example. A happy song might have the line “I love you” while a sad song might have a line “I used to love you”. Both have the positive word “love” in them but the second line could be viewed as sad, love lost. This is where querying lyrics on the same line could help. Its more complex than matching positive and negative words.

Conclusion This was fun and I got a little Father daughter time in as well. I’d like to pursue this to see what can be done by considering phrases and connected words.

Next up: Lady Gaga

Posted in Uncategorized

Like a lot of people I grew up with video games. But these were quit different from what we have today. Space invaders, Lunar lander, Missile Command and Asteroids look like cave drawings when compared to what is available today.  I have experimented with tools like LightWave and Maya but their costs are prohibitive and they are not really suited for amateur game developers. Unity 3D, on the other hand, is ideally suited for those just getting started with game development. In addition, it can easily support more complex professional games. Their recent announcement for free support for mobile applications means its time for me to make the leap.

A modern game typically requires a lot of people, mainly artists, to create scenes and characters. I can  use tools such as Blender but I am not nearly proficient enough to build the images as well as create the game. I need a game where I can leverage existing art work and just focus on the mechanics of  the game and learning Unity.

What I need is a  2D side scrolling space game. I decided on trying to replicate the Lunar Lander game.


It won’t be an exact match but instead more updated and something that fits with the Unity model. Look around in the Apple and Google app stores and you can find a number of these games. Some are 2D while others are 3D and much more realistic. I am not trying to be the next “Flappy Bird” so I don’t expect to compete with other games. Its all about the learning.

Unity 3D

A lot can bee done with Unity right out of the box. Anything that requires reacting to a user(player) in going to bring up the need to add custom coding. There are two choices for doing this, C# and Javascript.  A lot of the tutorials and examples are in Javascript so I’ll stick with this.

The Game

The point of game is to land the ship on the surface before you run out fuel and crash. In the earlier games the ship would rotate as well as translate. Correcting the rotation makes the game much more difficult to play. For this version I’ll stick with simple translation left, right, up and down. Of course there needs to be a surface to land on. A simple flat surface is boring. Adding some sort of obstacles will make it a bit more challenging.

Things to consider:

  • The ship
  • Obstacles
  • Landing
  • Movement
  • Gravity
  • Fuel
  • Crashing
  • Player controls
  • Scoring
  • Sound

The Ship

Unity can import models from many tools such as Blender and Max 3D. For a mobile game the model can not be too complex. The more detailed the poorer the game performance will be. I found a reasonably sized lunar lander model from NASA that is free to use.



In the original game the surface changed from flat to mountains. I decided to add rocks to a flat surface. In order to make things a bit more complex I added the rocks at random locations and sizes.

rocks rocks2


The rocks provide obstacles to avoid but there needs to be a ‘safe’ landing place. These are marked ‘green’ so the player can be seen. Since the rocks are randomly placed the landing places need to be adjusted as well. The process is to place a landing spot and then place the rocks. The code has to make sure the rocks are not covering the landing place and that there is enough room for the lander.

Startup code to build the scene:

Declare the rocks and landing pads

var rocks: Transform[];
var landingPads: Transform[];

Find the game object tagged GUI so that we can determine the player’s level. The landing pads are adjusted differently once the player is beyond level one.

Create the landing pads by varying the “x” value.

GUI = GameObject.FindGameObjectWithTag("GUI").GetComponent(InGameGUI);
 if(GUI.playerLevel > 1)
 startx = (GUI.playerLevel * 1.1)* 4895.0;
 startx = 4895.0;
 currentXoffset =startx + 1200*Random.Range(3,10);
 for(i =1; i < numberOfLandingPads; i++) {
 lp = Instantiate(landingPads[0], Vector3 (currentXoffset,-69.0, 514.6719), Quaternion.identity);
 lp.transform.localScale.x = 160;
 lp.transform.localScale.y = 1.1;
 lp.transform.localScale.z = 160;
 lp_locations[lp_locations_index,0] = currentXoffset;
 lp_locations[lp_locations_index,1] = (lp.transform.localScale.x*5);
 currentXoffset += (lp.GetComponent.().bounds.size.x*Random.Range(3,6));

Create a 1000 rocks. Each rock is generated in a random x location. The height of each rock is also random( y direction). The game is 2D but I am using Unity in 3D mode. For creating the rocks I am creating a 3D field. At some point I may change the game to be more 3D.  Each rock is check to make sure that it doesn’t  overlap with a landing pad. I didn’t want the code to get stuck in the overlap process so after 10 tries I give up.

for (var x = 0; x < 1000; x++) {  var breakOut=0;  do {  var index = Random.Range(0,4);  var locX = Random.Range(-50000,50000);  var locZ = Random.Range(-3000,2000);  var scaleX = 200;//Random.Range(Random.Range(5,50),Random.Range(150,200));  var scaleY = Random.Range(Random.Range(5,50),Random.Range(70,500));  if(GUI.playerLevel >2)
 scaleY = Random.Range(Random.Range(5,50),Random.Range(70,GUI.playerLevel*500));
 var scaleZ = 400; //Random.Range(Random.Range(5,50),Random.Range(50,100));
 // Debug.Log( " Creating rocks locX "+locX + " locZ " +locZ +" scaleX " +scaleX+ " scaleY " +scaleY); 
 if(breakOut > 10)
 // Debug.Log("==============breakOut++++++++++++");
 } while (checkOverlap(locX) );
 rock = Instantiate(rocks[index], Vector3 (locX, 0, locZ), Quaternion.identity);
 rock.transform.localScale.x = scaleX;
 rock.transform.localScale.y = scaleY;
 rock.transform.localScale.z = scaleZ; 
 rock.tag = "rock";

A lot of values are hardcoded simply for expedience.  Good software practice would be to use variables or contestants


Since the game has more than one or two controls it requires the addition of buttons. Keyboard controls are not an options and  multi-touch is complicated. I need to control the main engine(up), left and right thrusters and a pause button.

A ParticleEmitter is used to indicate engine or thruster action.

var engineThruster : ParticleEmitter;
var LeftThruster : ParticleEmitter;
var RightThruster : ParticleEmitter;

An audio file is played when the engine is on. While the engine button is pressed the emitter is set too true

// if the Emitter is not running then fire it
// and play the sound
// then move ship up
 if(engineThruster.emit == false)
 engineThruster.emit = true; 
function moveShip_up(){
 var dir:Vector3;
 // if we are out of fuel then do not move the ship
 if(fuelMeterCurentValue == 0)
 // update the fuel status
// get the local pos
 pos = Camera.main.WorldToScreenPoint(transform.position);
 // if the ship is higher than the screen 
 // set the velocity to 0
 if( pos.y >= Screen.height)
 // yMovement is either 1 or 0 depending on the button pushed
 // it limits movement to X or Y movement only
 // adjust the upward velocity the further away from the ground.
 // the value '200' should be replaced with ratio of the screen 
 // height
 if( (pos.y < ceiling) && (pos.y > Screen.height-200))
     dir = Vector3(0,yMovement*upwardThrust/2.0,0);
    if( (pos.y < Screen.height-200) && (pos.y > Screen.height/2))
      dir = Vector3(0,yMovement*upwardThrust/1.5,0);
     dir = Vector3(0,yMovement*upwardThrust,0);
   // add force to the ship



The assumption is that the planet has gravity. I have left the gravity setting standard as Unity sets it.


Fuel usage is adjusted when ever the engine is running. In the FixedUpdate() Unity function the fuel is adjusted:

 fuelMeterCurentValue -=fuelLossRate*Time.deltaTime;

The term  Time.deltaTime increments the fuel usage according to the FixedUpdate() rate. It is standard in Unity to do this when doing something in the fixed update call.


There are two ways to fail a landing. One is to land on rocks. The other is to land too fast. A vertical velocity indicator turns red when the ship is landing too fast. When the  ship touches the landing pad the velocity is checked. The function OnCollisionEnter() is called when two objects touch. In this case it will be the ship and either a landing pad or a rock. setting Time.timeScale to zero stops the game play. the GUI.guiMode is set to either win or lose. This will cause the correct screen to be displayed and the score to be adjusted.

 if( theCollision.gameObject.tag == "landingpad" )
   if( (theCollision.relativeVelocity.magnitude > 50.0) )
    GUI.guiMode ="Lose";
    Time.timeScale = 0;
   Time.timeScale = 0;
   GUI.guiMode ="Win";


Since this is a mobile game there needs to be buttons for the player. A single touch would work if it was to run the lander engine. Left and right translations are harder. Touch to the left of the lander could go left and the same for right.  Since the lander moves it could move under the touch point and cause the movement to change. Buttons just seem easier.

Unfortunately Unity’s UI is not straight forward.The placement and operation of a button is pretty simple. Buttons are  GUITexture components. Getting the position and sizing correct for different size devices is a challenge. There is talk that future versions of Unity will have better UI tools.

In the FixedUpDate function I test each button.

for (var touch : Touch in Input.touches)
    if (engineButton.HitTest (touch.position))
      // handle engine event
    if (leftThrusterButton.HitTest (touch.position))
      // handle left thruster event


Scoring is pretty straight forward. Land successfully and you get a point and proceed  to the next level. Crash and you have to repeat the level. At each level the landing spots get harder to find. As the level increases I need to increase the fuel(or lower the rate at which its is used).


Sound is handled from an AudioSource component.


This plays the sound once. As long as the button is held down the sound will be played over and over. Playing the sound in a loop is possible for something like background music. For sounds like the engine or  thrusters I need short burst of sound.

Screen Shots

The ship approaching a landing pad. The vertical velocity is in white and positive. This indicates that the ship is moving up at rate within the range for landing.


Since the landing pads are randomly placed I found it hard to locate them and no run out of fuel. I added a overhead view in the upper right corner to guide the player towards a landing pad.


The left corner shows the fuel and velocity levels.


The ship over the rocks. The vertical velocity is in red and negative. This indicates that the ship is moving down at rate too large to land.


Goggle Play

I decided to put the game on Goggle Play just to see how this process works.

Update: I see one person has complained that at a high level you just crash into the rocks. It could be that this is a fuel issue. The landing pads are too far away for the fuel usage rate.

Once I get the iOS version to work I’ll put it on the Apple Store as well.

Posted in Uncategorized

back to the jobs data..

The task is to query the system each and store the results. The goal is to have sufficient data to process through a Hadoop/Spark process and perform text analysis. Many of the postings, especially from agencies, are duplicates. I want to see how well one could match job posting using Spark machine learning clustering.

Since the API will also return lat and lon there is the opportunity
to do some spatial analysis.

The API for the job site lets you filter by keyword, state, city, date and employer or employer/agency.
You can also limit the data returned each time. Using the ‘-‘ with a keyword will ignore listings that include that word.
At this stage I want to ignore jobs from agencies and contract jobs. This is because many agencies post the same job and many ignore the location,i.e post jobs in MA that are located in CA.

For the second part of this experiment I will change this to pick up all jobs and try to use to classification to identify similar jobs.

I define several lists:
1. The states to check
2. The language keywords to use
3. Skill set keywords

states = [‘ME’,’MA’,’TN’,’WA’,’NC’,’AZ’,’CA’]
languageset = [‘java’,’ruby’,’php’,’c’,’c++’,’cloture’,’javascript’,’c#’,’.net’]
skillset = [‘architect’,’team lead’]

The API expects a parameter dictionary to be passed in. The default dictionary is:

Besides using the “-” to ignore keywords I am setting “as_not” to two agencies that I know to ignore. “sr” and “st” are set to try and avoid contract jobs and agencies.

params = {
‘as_not’ : “ATSG+Cybercoders”,
‘l’ : “ma”,
‘limit’ :”100000″,
‘start’: “0”,
‘userip’ : “”,
‘useragent’ : “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2)”
Each day(yes, I should set it up with cron to run) I run it.

I convert the data into a Result object.

class Result:
name,company,location,jobtitle,city,state,country,source,date,latitude,longitude,jobkey, expired,sponsored ,snippet,url,meta

It is probably over kill and I could simply skip this step and go right to the database. I really like the compactness of the object and it makes the code look cleaner.

for each languageset
for each state
convert the data to a result object
get the url to the main posting
get the the page
use BeautifulSoup to parse the html
get the content section of the page
store the result in Neo4j database

add to Neo4j
use graph.merge_one() to create state,city and language nodes

create new Job node(jobkey)
job key is from the api and has a unique constraint to avoid adding the same one again.
set Job properties(lat,lon,url,snippet.content, posted date, poll data)
create relationships
Relationship(job, “IN_STATE”, state)
Relationship(job, “IN_CITY”, city)
Relationship(job, “LANGUAGE”, lang)

That is the code. after a few false starts I have been able to get it run and gather about 16k listings.


Below are some of the query results. I used Cypher’s RegEx to match on part of the job title.

match (j:Job)–(state:State{name:’NH’}) where =~ ‘Java Developer.*’ return j,state
match (j:Job)–(state:State{name:’NH’}) where =~ ‘Chief.*’ return j,state

  • Java Developer in New Hampshire
  • Java Developer in Maine and New Hampshire
  • Chief Technology in New Hamphire
  • PHP Developer in all the states polled

java dev nh me

Java Developer in New Hampshire and Maine

java dev nh

Java Developer in New Hampshire

ct nh

Chief Technology New Hampshire

php dev

PHP Developer all States

One of the goals is run Spark’s machine learning lib against the data. As a first test I will count the words in the job title. In order to determine if the process is working I counted the words in job titles for New Hampshire. Now I have something to compare to after the Spark processing.
Below is a sample of the word count for all job polled in New Hampshire

word count
analyst 14
application 10
applications 7
architect 15
architecture 4
associate 3
automation 5
business 2
chief 2
cloud 4
commercial 3
communications 4
computer 2
consultant 4
database 4
designer 3
developer 59
development 15
devices 3
devops 3
diem 3
electronic 2
embedded 5
engineer 83
engineering 2
engineers 2
integration 4
java 31
junior 3
linux 3
management 4
manager 8
mobile 3

I have Hadoop and Spark running. I need to get mazerunner installed and run a few tests. Then the fun begins…

Posted in Uncategorized

Machine learning Scrum Sprint estimates

Another idea as part of my “A year with data” exploration.

Anyone who has worked in a Scrum/Agile environment understands the pain involved with task estimation. Some of the methods include shirt sizes (S, M, L),  the Fibonacci sequence (1, 2, 3, 5, 8.), powers of 2 (1, 2, 4, 8) and even poker. Then there is the process of dealing with disparate estimates. One person gives an estimate of 2 and another suggests its  13 . After some discussion its agreed that the task is a 8. At the end of the sprint maybe it turns out that the task really was a 5. It would be useful,and interesting, to determine how well people do in their estimation. Is person A always under estimating? Person B is mostly spot on,….

This seems like a good candidate for Machine Learning, supervised learning to be more specific. I am not sure how many teams capture  information from the estimation process but they should.

Basic information such as:

  • The original estimates for each team member
  • The final agreed upon estimates
  • The actual size of the task once completed

The data might look like this:

Task TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 Actual
1 1 8 1 13 5 8 2 5 13 8
2 3 8 5 8 8 5 3 1 8 5
3 2 5 5 5 5 2 1 8 1 3
4 8 5 6 3 1 2 2 13 5 5
5 3 5 5 8 8 8 8 13 13 13
6 1 3 5 1 1 1 1 2 5 2
7 1 3 5 1 1 5 8 5 3 2
8 5 3 5 3 2 1 1 3 2 1
9 8 8 6 5 8 8 13 3 5 5
10 2 5 5 8 8 8 8 8 8 13

The ‘training’ data consists of ten tasks, the estimates from each of the nine team members and actual value the task turned out to be. I choose the Fibonacci sequence as a method for estimates.  Another piece of information that could be useful is the estimate the team agreed upon. That could be compared to the actual value as well. I decided not to do since it hides the interesting information of each team members estimate. By using each team members input we could determine which ones are contributing more or which ones are further off in their estimates.

Gradient decent

I am not going to try and explain Gradient Decent as there are others much better qualified to do the that. I found the Stanford Machine learning course to be the most useful. The downside is that the course used Octave and  I want to use Python. There is a bit of a learning curve trying to make the change. Hopefully I have this figured out.

The significant equations are below.

The cost function J(θ) represents how well theta can predict the outcome.

h theta

Where xj(i) represents each team member’s estimate for all of the task.
x(i) represents the estimate(feature) vector of the training set.
θT is the transpose of the theta; vector

hθ(x(i)) is the predicted value.

The math looks like this.

j theta



For this I am using the following python packages


Note: In order to use seaborn residplot I had to install ‘patsy’ and ‘statsmodel’

easy_install patsy
pip install statsmodels

Set up pandas to display correctly
pd.set_option(‘display.notebook_repr_html’, False)

The first step is to read the data.

training_set = pd.read_csv('estimate data.txt')

Next, we need to separate the estimates from the actual values

 tm_estimates = training_set[['tm1','tm2','tm3','tm4','tm5','tm6','tm7','tm8','tm9']] 
        tm1  tm2  tm3  tm4  tm5  tm6  tm7  tm8  tm9
         1    8    1   13    5    8    2    5   13
         3    8    5    8    8    5    3    1    8
         2    5    5    5    5    2    1    8    1
         8    5    6    3    1    2    2   13    5
         3    5    5    8    8    8    8   13   13
         1    3    5    1    1    1    1    2    5
         1    3    5    1    1    5    8    5    3
         5    3    5    3    2    1    1    3    2
         8    8    6    5    8    8   13    3    5
         2    5    5    8    8    8    8    8    8

 actuals = training_set['est']

A distribution plot of the actuals
actuals dis plot


One thing to consider is the Normalization of the data. This is important when data values vary greatly. In this case the data is not all that different but its worth the effort to add this step.


 mean = tm_estimates.mean()
    std = tm_estimates.std()

    tm_estimates_norm = (tm_estimates - mean) / std
tm1 tm2 tm3 tm4 tm5 tm6 tm7 tm8 tm9
-0.147264 1.312268 0.143019 0.661622 1.031586 0.064851 -0.403064 -1.184304 0.405606
-0.515425 -0.145808 0.143019 -0.132324 0.093781 -0.907909 -0.877258 -1.184304 0.405606
1.693538 -0.145808 0.858116 -0.661622 -1.156627 -0.907909 -0.640161 0.441211 -1.264536
-0.883585 -1.117858 0.143019 -1.190919 -1.156627 -1.232162 -0.877258 1.602294 1.598564
-0.883585 -1.117858 0.143019 -1.190919 -1.156627 0.064851 0.782419 -0.952088 -0.310169
0.589057 -1.117858 0.143019 -0.661622 -0.844025 -1.232162 -0.877258 -0.255438 -0.787353
1.693538 1.312268 0.858116 -0.132324 1.031586 1.037610 1.967903 -0.719871 -1.025944
-0.515425 -0.145808 0.143019 0.661622 1.031586 1.037610 0.782419 0.441211 0.405606

To satisfy the equation we need to add an extra column for theta0. For that we add x0 and set all of the values to 1

# the number of data points
m = len(tm_estimates_norm)
#add the x0 column and set all values to one.
tm_estimates_norm['x0'] = pd.Series(np.ones(m))

Next we define the learning rate alpha to be 0.15. The number of iterations is 150. Setting these two values will control how well the cost function converges.

    alpha = 0.15
    iterations = 150


Set the initial values of theta to zero. Then convert the data into numpy arrays instead of python strutures.

    # Initialize theta values to zero
    thetas = np.zeros(len(tm_estimates_norm.columns))
    tm_estimates_norm = np.array(tm_estimates_norm)
    estimations = np.array(actuals)
    cost_history = []

Now do something!
First calculate the prediction. Theta . estimates.
Next perform the the J(0) calculation
Calculate the cost and record the cost. This last step will tell us if the process is decreasing or not.

    for i in range(iterations):
    # Calculate the predicted values
        predicted =, thetas)

        # Calculate the theta 
        thetas -= (alpha / m) * - estimations), tm_estimates_norm)
        # Calculate cost
        sum_of_square_errors = np.square(predicted - estimations).sum()
        cost = sum_of_square_errors / (2 * m)

        # Append cost to history

I tried different combinations of alpha and iterations just to see how this works.

The first attempt is using alpha = 0.75
high learning rate 75

This next try uses alpha = 0.05 and iterations = 50
lower learning rate 05

This last on represents alpha = 0.15 and iterations = 150

7.923283-0.076717 5.4614750.461475/td>  3.4814650.481465 4.404572-0.595428  14.2873011.287301 1.225380–0.774620  2.7378480.737848 /td>  .467125.0207895.020789- 0.020789 10.990762-2.009238

actuals predictions difference
8 7.923283 -0.076717
5 5.461475 0.461475
3 3.481465 0.481465
5 4.404572 -0.595428
13 14.287301 1.287301
2 1.225380 -0.774620
2 2.737848 0.737848
1 1.467125 0.467125
5 5.020789 /td> 0.020789
13 10.990762 -2.009238

This graph shows the linear fit between the predicted and actual values


This graph shows the difference between the predicted and actual values


The data set is far too small to declare anything. The cases where the actual was high there is less data and the error is greater. In order to get more data I’ll have to make it up. Having worked in development for years( many) I know that people tend to follow a pattern when giving estimates. Also the type of task will dictate estimates. A UI task may seem simple to someone familiar with UI development. While a server/backend person may find a UI task daunting. In deciding how to create sample data I devised a scheme to give each team member a strength in  skills, UI, database, and server. Also each member has a estimation rating. This defines how they tend to rates tasks, low, high, mix or random. Once I get this working I start over and see how this new data performs.


Until then…

Posted in data

A year with data: Day 2

Wow! two days in a row, good for me.

Where to start… Probably with the data.

Project tycho.
The project has gathered data(level 2) over a 126 year period(1888 to 2014). Divided in to cases and deaths it include fifty diseases, fifty states and 1284 cities. Access is via a web service. There are calls to get a list of all diseases, states, cities, cases and deaths. Using Python, I pulled the various pieces and stored each is a file. The process takes a while it was better to get the data once and then format it as needed. For each state/city I also obtained the lat/lon information. Finally I gathered all of the data into one file where each record looked like the ones below:

Event St City Disease Year Week Count Lat Lon
Case, AK, KETCHIKAN,MEASLES, 1914, 24, 1, 55.34222219999999, -131.6461111
Death,AK, KETCHIKAN,MEASLES, 1914, 24, 1, 55.34222219999999, -131.6461111

The process:
1. Get all diseases.
2. Get all States.
3. For each State get all cities.
4. For each State/City geocode the city.
5. For each Disease.
For each State and City get events.

Some python code:

The code below gives examples of how to pull the data from the Tycho site. The key is assigned by them. I found some cases where there are ‘[‘ and ‘]’ characters is the data. Since I couldn’t determine what to do with this I simply skip it. I also check for commas and spaces which make parsing difficult.

def get_disease(key):
    listOfdisease = []
    url = ''+key 
    response = urllib.request.urlopen(url)
    html =
    xml = et.fromstring(html.decode('utf8'))
    myfile = open("data/disease"+".data","w")
    for element in xml.findall("row") :
        type = element.find("disease")
        # remove characters we dont want '[' ']' '/'
        if  not  "[" in  type.text and  not  "]"  in  
        type.text and not  "/" in type.text:
            type.text = type.text.replace(" ","_")
            type.text =type.text.replace(",","")
            print (type.text )

    return listOfdisease

Find the state from the string. Each state is defined by the tag ‘loc’

   xml = et.fromstring(html.decode('utf8'))
    for element in xml.findall("row") :
        StateAbv = element.find("state")

        State = element.find("loc")

Finding cases or deaths is bit more complicated. The field ‘number’ represents the number of events for that period.

 for element in xml.findall("row") :
        year = element.find("year")
        week = element.find("week")
        number = element.find("number")
        if int(number.text)  > 0:
            case = Case(disease,year.text,week.text,number.text,state)  

(Moving to GitHub soon).

The hardware.  For the most part I just use a Windows laptop. I need to run Hadoop and Spark and since I use the laptop for work I need a different solution.Something that I can run without disruption . Hadoop is marketed as running on commodity hardware. Lets see. I have two old systems that I have installed Linux(Ubuntu) on. Also I installed Hadoop and a host of other support stuff. I need to get these set up as a cluster at some point.

Posted in Uncategorized

A year of data..

A year with data

I have been trying to work on data analysis for a while, but its been a lot of start and stop. I started with pure spatial data(University of Maine, Spatial Information) and then started working with public health data. Eventually I came to understand that the two are connected. Considering where events occurred can be helpful in understanding how to handle
public health issues. Some guy named Snow figured this out in the 1850’s. The big data movement has made things like machine learning,R, NLP, Hadoop, Pandas and Spark popular. I have decided to spend the next year mucking about with a couple of data sets to get a better idea of what can and can’t be done.

The data.
There are two data sets I plan to use(so far).
The first is public health information assembled by Project Tycho, University of Pittsburgh . The project has gathered public health data for over a hundred years. It consists of events(cases or deaths) due to disease. Each event is associated with a State, City, Year, and Day of the year. I have added Lat and Lon for each City.

The second set is being created each day(when I remember…). It involves pulling data from a job board using their api. This data is nice because it is changing every day. It also has a lot of free text that might be useful for NLP or classification.

My day job involves Java. For this effort I’d like to stay with Python. There are some exceptions where Java might make more sense, loading large data sets, or Hadoop MapReduce. I am using Django to create web apps as needed.
Python has it own analysis library, but works well with R. Probably a good path to stay with.

My favorite data store is Neo4j, the graph database.

Posted in Uncategorized