Another idea as part of my “A year with data” exploration.
Anyone who has worked in a Scrum/Agile environment understands the pain involved with task estimation. Some of the methods include shirt sizes (S, M, L), the Fibonacci sequence (1, 2, 3, 5, 8.), powers of 2 (1, 2, 4, 8) and even poker. Then there is the process of dealing with disparate estimates. One person gives an estimate of 2 and another suggests its 13 . After some discussion its agreed that the task is a 8. At the end of the sprint maybe it turns out that the task really was a 5. It would be useful,and interesting, to determine how well people do in their estimation. Is person A always under estimating? Person B is mostly spot on,….
This seems like a good candidate for Machine Learning, supervised learning to be more specific. I am not sure how many teams capture information from the estimation process but they should.
Basic information such as:
- The original estimates for each team member
- The final agreed upon estimates
- The actual size of the task once completed
The data might look like this:
Task | TM1 | TM2 | TM3 | TM4 | TM5 | TM6 | TM7 | TM8 | TM9 | Actual |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 8 | 1 | 13 | 5 | 8 | 2 | 5 | 13 | 8 |
2 | 3 | 8 | 5 | 8 | 8 | 5 | 3 | 1 | 8 | 5 |
3 | 2 | 5 | 5 | 5 | 5 | 2 | 1 | 8 | 1 | 3 |
4 | 8 | 5 | 6 | 3 | 1 | 2 | 2 | 13 | 5 | 5 |
5 | 3 | 5 | 5 | 8 | 8 | 8 | 8 | 13 | 13 | 13 |
6 | 1 | 3 | 5 | 1 | 1 | 1 | 1 | 2 | 5 | 2 |
7 | 1 | 3 | 5 | 1 | 1 | 5 | 8 | 5 | 3 | 2 |
8 | 5 | 3 | 5 | 3 | 2 | 1 | 1 | 3 | 2 | 1 |
9 | 8 | 8 | 6 | 5 | 8 | 8 | 13 | 3 | 5 | 5 |
10 | 2 | 5 | 5 | 8 | 8 | 8 | 8 | 8 | 8 | 13 |
The ‘training’ data consists of ten tasks, the estimates from each of the nine team members and actual value the task turned out to be. I choose the Fibonacci sequence as a method for estimates. Another piece of information that could be useful is the estimate the team agreed upon. That could be compared to the actual value as well. I decided not to do since it hides the interesting information of each team members estimate. By using each team members input we could determine which ones are contributing more or which ones are further off in their estimates.
Gradient decent
I am not going to try and explain Gradient Decent as there are others much better qualified to do the that. I found the Stanford Machine learning course to be the most useful. The downside is that the course used Octave and I want to use Python. There is a bit of a learning curve trying to make the change. Hopefully I have this figured out.
The significant equations are below.
The cost function J(θ) represents how well theta can predict the outcome.
Where xj(i) represents each team member’s estimate for all of the task.
x(i) represents the estimate(feature) vector of the training set.
θT is the transpose of the theta; vector
hθ(x(i)) is the predicted value.
The math looks like this.
For this I am using the following python packages
numpy
pandas
matplotlib.pyplot
seaborn
Note: In order to use seaborn residplot I had to install ‘patsy’ and ‘statsmodel’
easy_install patsy
pip install statsmodels
Set up pandas to display correctly
pd.set_option(‘display.notebook_repr_html’, False)
The first step is to read the data.
training_set = pd.read_csv('estimate data.txt')
Next, we need to separate the estimates from the actual values
tm_estimates = training_set[['tm1','tm2','tm3','tm4','tm5','tm6','tm7','tm8','tm9']] print(tm_estimates) tm1 tm2 tm3 tm4 tm5 tm6 tm7 tm8 tm9 1 8 1 13 5 8 2 5 13 3 8 5 8 8 5 3 1 8 2 5 5 5 5 2 1 8 1 8 5 6 3 1 2 2 13 5 3 5 5 8 8 8 8 13 13 1 3 5 1 1 1 1 2 5 1 3 5 1 1 5 8 5 3 5 3 5 3 2 1 1 3 2 8 8 6 5 8 8 13 3 5 2 5 5 8 8 8 8 8 8 actuals = training_set['est']
sns.distplot(actuals) plt.show()
A distribution plot of the actuals
One thing to consider is the Normalization of the data. This is important when data values vary greatly. In this case the data is not all that different but its worth the effort to add this step.
mean = tm_estimates.mean() std = tm_estimates.std() tm_estimates_norm = (tm_estimates - mean) / std print(tm_estimates_norm)
tm1 | tm2 | tm3 | tm4 | tm5 | tm6 | tm7 | tm8 | tm9 |
---|---|---|---|---|---|---|---|---|
-0.147264 | 1.312268 | 0.143019 | 0.661622 | 1.031586 | 0.064851 | -0.403064 | -1.184304 | 0.405606 |
-0.515425 | -0.145808 | 0.143019 | -0.132324 | 0.093781 | -0.907909 | -0.877258 | -1.184304 | 0.405606 |
1.693538 | -0.145808 | 0.858116 | -0.661622 | -1.156627 | -0.907909 | -0.640161 | 0.441211 | -1.264536 |
-0.883585 | -1.117858 | 0.143019 | -1.190919 | -1.156627 | -1.232162 | -0.877258 | 1.602294 | 1.598564 |
-0.883585 | -1.117858 | 0.143019 | -1.190919 | -1.156627 | 0.064851 | 0.782419 | -0.952088 | -0.310169 |
0.589057 | -1.117858 | 0.143019 | -0.661622 | -0.844025 | -1.232162 | -0.877258 | -0.255438 | -0.787353 |
1.693538 | 1.312268 | 0.858116 | -0.132324 | 1.031586 | 1.037610 | 1.967903 | -0.719871 | -1.025944 |
-0.515425 | -0.145808 | 0.143019 | 0.661622 | 1.031586 | 1.037610 | 0.782419 | 0.441211 | 0.405606 |
To satisfy the equation we need to add an extra column for theta0. For that we add x0 and set all of the values to 1
# the number of data points m = len(tm_estimates_norm) #add the x0 column and set all values to one. tm_estimates_norm['x0'] = pd.Series(np.ones(m))
Next we define the learning rate alpha to be 0.15. The number of iterations is 150. Setting these two values will control how well the cost function converges.
alpha = 0.15 iterations = 150
Set the initial values of theta to zero. Then convert the data into numpy arrays instead of python strutures.
# Initialize theta values to zero thetas = np.zeros(len(tm_estimates_norm.columns)) tm_estimates_norm = np.array(tm_estimates_norm) estimations = np.array(actuals) print(estimations) cost_history = []
Now do something!
First calculate the prediction. Theta . estimates.
Next perform the the J(0) calculation
Calculate the cost and record the cost. This last step will tell us if the process is decreasing or not.
for i in range(iterations): # Calculate the predicted values predicted = np.dot(tm_estimates_norm, thetas) # Calculate the theta thetas -= (alpha / m) * np.dot((predicted - estimations), tm_estimates_norm) # Calculate cost sum_of_square_errors = np.square(predicted - estimations).sum() cost = sum_of_square_errors / (2 * m) # Append cost to history cost_history.append(cost)
I tried different combinations of alpha and iterations just to see how this works.
The first attempt is using alpha = 0.75
This next try uses alpha = 0.05 and iterations = 50
This last on represents alpha = 0.15 and iterations = 150
7.923283-0.076717 5.4614750.461475/td> 3.4814650.481465 4.404572-0.595428 14.2873011.287301 1.225380–0.774620 2.7378480.737848 /td> .467125.0207895.020789- 0.020789 10.990762-2.009238
actuals | predictions | difference |
8 | 7.923283 | -0.076717 |
5 | 5.461475 | 0.461475 |
3 | 3.481465 | 0.481465 |
5 | 4.404572 | -0.595428 |
13 | 14.287301 | 1.287301 |
2 | 1.225380 | -0.774620 |
2 | 2.737848 | 0.737848 |
1 | 1.467125 | 0.467125 |
5 | 5.020789 /td> | 0.020789 |
13 | 10.990762 | -2.009238 |
This graph shows the linear fit between the predicted and actual values
This graph shows the difference between the predicted and actual values
The data set is far too small to declare anything. The cases where the actual was high there is less data and the error is greater. In order to get more data I’ll have to make it up. Having worked in development for years( many) I know that people tend to follow a pattern when giving estimates. Also the type of task will dictate estimates. A UI task may seem simple to someone familiar with UI development. While a server/backend person may find a UI task daunting. In deciding how to create sample data I devised a scheme to give each team member a strength in skills, UI, database, and server. Also each member has a estimation rating. This defines how they tend to rates tasks, low, high, mix or random. Once I get this working I start over and see how this new data performs.
Until then…