## Machine learning Scrum Sprint estimates

Another idea as part of my “A year with data” exploration.

Anyone who has worked in a Scrum/Agile environment understands the pain involved with task estimation. Some of the methods include shirt sizes (S, M, L),  the Fibonacci sequence (1, 2, 3, 5, 8.), powers of 2 (1, 2, 4, 8) and even poker. Then there is the process of dealing with disparate estimates. One person gives an estimate of 2 and another suggests its  13 . After some discussion its agreed that the task is a 8. At the end of the sprint maybe it turns out that the task really was a 5. It would be useful,and interesting, to determine how well people do in their estimation. Is person A always under estimating? Person B is mostly spot on,….

This seems like a good candidate for Machine Learning, supervised learning to be more specific. I am not sure how many teams capture  information from the estimation process but they should.

Basic information such as:

• The original estimates for each team member
• The final agreed upon estimates
• The actual size of the task once completed

The data might look like this:

Task TM1 TM2 TM3 TM4 TM5 TM6 TM7 TM8 TM9 Actual
1 1 8 1 13 5 8 2 5 13 8
2 3 8 5 8 8 5 3 1 8 5
3 2 5 5 5 5 2 1 8 1 3
4 8 5 6 3 1 2 2 13 5 5
5 3 5 5 8 8 8 8 13 13 13
6 1 3 5 1 1 1 1 2 5 2
7 1 3 5 1 1 5 8 5 3 2
8 5 3 5 3 2 1 1 3 2 1
9 8 8 6 5 8 8 13 3 5 5
10 2 5 5 8 8 8 8 8 8 13

The ‘training’ data consists of ten tasks, the estimates from each of the nine team members and actual value the task turned out to be. I choose the Fibonacci sequence as a method for estimates.  Another piece of information that could be useful is the estimate the team agreed upon. That could be compared to the actual value as well. I decided not to do since it hides the interesting information of each team members estimate. By using each team members input we could determine which ones are contributing more or which ones are further off in their estimates.

I am not going to try and explain Gradient Decent as there are others much better qualified to do the that. I found the Stanford Machine learning course to be the most useful. The downside is that the course used Octave and  I want to use Python. There is a bit of a learning curve trying to make the change. Hopefully I have this figured out.

The significant equations are below.

The cost function J(θ) represents how well theta can predict the outcome. Where xj(i) represents each team member’s estimate for all of the task.
x(i) represents the estimate(feature) vector of the training set.
θT is the transpose of the theta; vector

hθ(x(i)) is the predicted value.

The math looks like this. For this I am using the following python packages

numpy
pandas
matplotlib.pyplot
seaborn

Note: In order to use seaborn residplot I had to install ‘patsy’ and ‘statsmodel’

easy_install patsy
pip install statsmodels

Set up pandas to display correctly
pd.set_option(‘display.notebook_repr_html’, False)

The first step is to read the data.

```training_set = pd.read_csv('estimate data.txt')
```

Next, we need to separate the estimates from the actual values

``` tm_estimates = training_set[['tm1','tm2','tm3','tm4','tm5','tm6','tm7','tm8','tm9']]
print(tm_estimates)
tm1  tm2  tm3  tm4  tm5  tm6  tm7  tm8  tm9
1    8    1   13    5    8    2    5   13
3    8    5    8    8    5    3    1    8
2    5    5    5    5    2    1    8    1
8    5    6    3    1    2    2   13    5
3    5    5    8    8    8    8   13   13
1    3    5    1    1    1    1    2    5
1    3    5    1    1    5    8    5    3
5    3    5    3    2    1    1    3    2
8    8    6    5    8    8   13    3    5
2    5    5    8    8    8    8    8    8

actuals = training_set['est']
```
```
sns.distplot(actuals)
plt.show()
```

A distribution plot of the actuals One thing to consider is the Normalization of the data. This is important when data values vary greatly. In this case the data is not all that different but its worth the effort to add this step. ``` mean = tm_estimates.mean()
std = tm_estimates.std()

tm_estimates_norm = (tm_estimates - mean) / std
print(tm_estimates_norm)
```
tm1 tm2 tm3 tm4 tm5 tm6 tm7 tm8 tm9
-0.147264 1.312268 0.143019 0.661622 1.031586 0.064851 -0.403064 -1.184304 0.405606
-0.515425 -0.145808 0.143019 -0.132324 0.093781 -0.907909 -0.877258 -1.184304 0.405606
1.693538 -0.145808 0.858116 -0.661622 -1.156627 -0.907909 -0.640161 0.441211 -1.264536
-0.883585 -1.117858 0.143019 -1.190919 -1.156627 -1.232162 -0.877258 1.602294 1.598564
-0.883585 -1.117858 0.143019 -1.190919 -1.156627 0.064851 0.782419 -0.952088 -0.310169
0.589057 -1.117858 0.143019 -0.661622 -0.844025 -1.232162 -0.877258 -0.255438 -0.787353
1.693538 1.312268 0.858116 -0.132324 1.031586 1.037610 1.967903 -0.719871 -1.025944
-0.515425 -0.145808 0.143019 0.661622 1.031586 1.037610 0.782419 0.441211 0.405606

To satisfy the equation we need to add an extra column for theta0. For that we add x0 and set all of the values to 1

```# the number of data points
m = len(tm_estimates_norm)
#add the x0 column and set all values to one.
tm_estimates_norm['x0'] = pd.Series(np.ones(m))
```

Next we define the learning rate alpha to be 0.15. The number of iterations is 150. Setting these two values will control how well the cost function converges.

```    alpha = 0.15
iterations = 150

```

Set the initial values of theta to zero. Then convert the data into numpy arrays instead of python strutures.

```
# Initialize theta values to zero
thetas = np.zeros(len(tm_estimates_norm.columns))

tm_estimates_norm = np.array(tm_estimates_norm)
estimations = np.array(actuals)
print(estimations)
cost_history = []
```

Now do something!
First calculate the prediction. Theta . estimates.
Next perform the the J(0) calculation
Calculate the cost and record the cost. This last step will tell us if the process is decreasing or not.

```    for i in range(iterations):
# Calculate the predicted values
predicted = np.dot(tm_estimates_norm, thetas)

# Calculate the theta
thetas -= (alpha / m) * np.dot((predicted - estimations), tm_estimates_norm)

# Calculate cost
sum_of_square_errors = np.square(predicted - estimations).sum()
cost = sum_of_square_errors / (2 * m)

# Append cost to history
cost_history.append(cost)

```

I tried different combinations of alpha and iterations just to see how this works.

The first attempt is using alpha = 0.75 This next try uses alpha = 0.05 and iterations = 50 This last on represents alpha = 0.15 and iterations = 150 7.923283-0.076717 5.4614750.461475/td>  3.4814650.481465 4.404572-0.595428  14.2873011.287301 1.225380–0.774620  2.7378480.737848 /td>  .467125.0207895.020789- 0.020789 10.990762-2.009238

 actuals predictions difference 8 7.923283 -0.076717 5 5.461475 0.461475 3 3.481465 0.481465 5 4.404572 -0.595428 13 14.287301 1.287301 2 1.225380 -0.774620 2 2.737848 0.737848 1 1.467125 0.467125 5 5.020789 /td> 0.020789 13 10.990762 -2.009238

### This graph shows the linear fit between the predicted and actual values ### This graph shows the difference between the predicted and actual values The data set is far too small to declare anything. The cases where the actual was high there is less data and the error is greater. In order to get more data I’ll have to make it up. Having worked in development for years( many) I know that people tend to follow a pattern when giving estimates. Also the type of task will dictate estimates. A UI task may seem simple to someone familiar with UI development. While a server/backend person may find a UI task daunting. In deciding how to create sample data I devised a scheme to give each team member a strength in  skills, UI, database, and server. Also each member has a estimation rating. This defines how they tend to rates tasks, low, high, mix or random. Once I get this working I start over and see how this new data performs.

Until then… 