Saturday, October 17, 2009

The fun of working together.

I am currently focused on finding and eliminating periodic latency spikes in our product. Pages are rendering pretty fast in general, well within our goals, but we still are not meeting goal on overall speed. There are moments of slow performance where a page for 97% of the day will be very fast, and then will suddenly start taking 10-20 seconds to render for periods lasting several minutes. Speed in this case is a reliability problem.

Up until recently we had been making a lot of progress looking for single operations on the back end that crossed resource budget thresholds. For example, if a database query was hitting the disk too heavily we filed a bug. This worked well for a while, but as we got closer and closer to goal, and as we filed lots of these kinds of bugs, it became more difficult to choose which ones were causing the biggest problem.

I wrote a tool to discover the time periods when pages were rendering more slowly. I specifically wrote it to count and measure clusters - how many clusters of slowness in a day, how slow did it get, how many requests are in the cluster, how long did the cluster last, etc. I told others about it.

We formed a team of three testers and two developers to look at these slow clusters on a daily basis. Each day is given to a member of the team and that person tries to discover the cause of the slow requests.

This is the part where working together is fun. We are inventing our diagnostic method as we go, and every day someone on the team comes up with a new cool way to find the cause of problems. Someone might say "There was a one hour period two days ago where all the requests were about 20 seconds long. The CPU wasn't heavy on either the web server or the database. The disk didn't look busy. What do I do next?" What happens then is this cool back and forth of ideas and creativity. People start writing new test tools, database queries and such to find the cause, or to show many times the problem is causing slow requests.

One example - a tester reporting to me wrote a query that extracts every time a specific event occurs on the server. He then lines that event with the slow performance events my tool identifies. Doing that he was able to say "this event account for 75% of the slow performance time periods on these two days.", which helped us decide whether or not to accept the code fix. Likewise, we had another fix whose effectiveness we wanted to evaluate before we put into the product. We installed the fix on our private production server, and compared frequency of the event before and after. Indeed the event frequency dropped, as did the correlated slow periods.

It is this rapid back and forth of creativity that makes working with people fun. A part of me wants to build the whole solution - have every idea, be the guy on everything. But another part of me really enjoys watching a group of people take an idea I had and make it even bigger and better.

No comments:

Post a Comment