Tuesday, July 3, 2012

TURN IT UP OR TURN IT DOWN?

You are approaching the end of a project. If it is a long project, the next six months or less may be all between you and release. If it is a short project, perhaps you are just waking up to the Monday before being done. Either way, most testers awaken at this moment to a creepy anxiety. A fear creeps into every last crevice of your conscious and unconscious awareness like worms in a corpse. Evoking worse dread and horror than even Edgar Allen Poe could spill from his pen, it screams at you like the beating of a telltale heart.

“What if I missed something?”

Any tester who doesn’t know that anxiety, that fear, isn’t really a tester. They are a developer, hiding out as a tester, waiting for a friend in some rogue game project to tap them on the shoulder. Or maybe they are a designer, tolerating the hours of button clicking and script hacking between coffee breaks with friends to debate the subtle importance of typography and layout on the electric aesthetic. Whatever they are, if that feat isn’t battering their skull until their conscience drips blood, they aren’t a tester.

End game anxiety is an obvious, inevitable byproduct of being the last one to deliver in the relay race that is software development. After the last bug fix, the tester still goes in, looks, and after shrugging their shoulders at finding nothing worthy of alarm, begrudgingly gives the “all clear” sign and watches with suppressed dread as the software goes to market, into production or is delivered to the customer. From that point on, the tester knows that if anything goes wrong, everybody is going to wonder why it wasn’t spotted in that last look under the hood.

How should a tester, or better yet, a whole team of testers, respond in the end game? We all feel that dread, but how should we react to it? I have seen two approaches, one very popular, and one I don’t see that often. For simplicity sake, I refer to them as “Turn it up” and “Turn it down”.

Turn it Up

Turn it Up is the reaction most test teams, leads and managers seem to prefer. The approach is to give full voice to the “What if I missed something?” anxiety and let it dictate the actions from this point in the schedule forward. This usually comes with the following:
-          Increased reliance and faith in short burst, high energy testing activities like bug bashes
-          Recommendations to the team to go wild, try everything
-          Calls for longer hours, later nights
-          Immediate, high intensity reactions to every issue found
-          Constant re-questioning of the engineering prior to this point – every issue discovered is assumed to be proof that an entire large component of previous testing was inadequate

This is not necessarily a bad way to go. There is a “last chance to get it right” principal at play here that is true in the most literal way possible. Backward looking skepticism is also healthy. In a long project, the tradeoffs, compromises, and barrage of “this is how it works” decisions along the way can blind a test team into contentedly letting even the most egregious flaws slide merrily into market. A position of indiscriminately requisitioning as the default stance helps shake apart this sort of bug blindness.

There are also people who feed off this energy. High degrees of energy cannot sustain for very long periods of time, so the team is going to go up and down, from one lull to the next. Death march schedules may be burning the people out, but they really do not generate output that matches the hours. So the energy bursts have to be chosen. A release target is concrete. People can point to its location on the calendar. They can measure time against it. This makes the end game a convenient anchor around which to rally. People feed off other people’s energy. Thus, the end game push is sometimes a way to get the team to come out of whatever malaise they are undoubtedly experiencing and pull together for the final push.

On the negative side, the extra energy creates a great deal of chaos. That chaos sometimes pushes people to fix a bug that really ought to be postponed. The “re-question everything” approach tends to create re-discovery of issues that were already known and triaged away, but perhaps seen by new eyes that do not understand the risks and reasons the triage happened. The extra push and energy sometimes causes people to stop making decisions and instead of focusing on what is really important they choose everything. Acting on everything burns time and resources that ought to be spent acting on less. The extra data coming in from the extra investigation clog the triage and communication pipelines, making decision take longer, cost more and happen less informed.

It is all a balance. You really cannot get a benefit without accepting the negatives that come with it.

Turn it Down

Turn it Down is a reaction I see used infrequently. This approach is to answer the “What if I missed something?” anxiety by admitting, “Of course we missed something” and then deciding that you are going to decide which of those things you really care about. This usually comes with the following:

-          Decrease the energy to reduce the chaos, such that things are moving fast, but in a more predictable and managed way
-          Start shutting things down. Call most of the project “Done”
-          Channeling energy, time and resources into those activities which are most critical to a successful release
-          Do everything with a deliberate and known plan
-          Reserve enough slack to react when alarms go off, but do not deploy everything in alarm mode until it happens

This also is a pretty good way to go. It requires having a plan for the end game going in that has all the priorities and risks clearly articulated, but from that plan it helps produce a predictable route to release. You know that the most important things are being looked at because you have shut down everything else. You don’t have to worry so much about burning people out because you are not necessarily asking for yet another long haul.

Just like there are people who feed off the high energy of  “Turn it Up” approach, there are people who work much better under the more ordered, controlled energy of a “Turn it Down” approach. Some people work more confidently when they know they have things behind them, and when they feel a clear sense of their priorities. They do better knowing that they will not asked to do a hundred things at once when only five are possible.

This approach also has its risks. Probably the most dangerous is if the decision to “leave things behind” is done in a way that is blind to the true state of a project. Stubborn determination to drop all cargo and hit a delivery date may just delivery an empty ship. Another risk is that people’s creativity will stifle such that they stop reporting or looking at problems. The “Turn it Down” approach does not do as much to break tester blindness.

So, Which Should It Be?

I am going to refuse to cut down the middle and say “You have to have a little bit of both” because in this case I don’t believe there it can happen. It is sort of an apples or oranges thing. You can have an apple, you can have an orange, but you cannot have an apple/orange (for sake of argument, I am assuming we don’t know how to slice fruit yet). I personally believe that the style is pure personality, and the only personality that I really believe matters is that of the person driving the team. Typically that would be the test manager, but in some teams it may be a body of test leads, in others it might just be an overall project manager. Whoever it is, I don’t believe the “Turn it Up” v.s. “Turn it Down” preferences tune to the middle, and whatever makes that person get up in the morning is going to dictate what will happen in the team.

As noted in “In Search of Excellence”, an excellent leader’s personality is echoed through the corporate or team culture. So, if the manager is effective, and they live for end game energy, then the team is going to do better on a “Turn it Up” approach. On the flipside, if an equally effective manager lives for landing the plan calmly and smoothly, then the team is going to do better on a “Turn it Down” approach.

I know my own approach, “Turn it Down”, works, as I have led teams through it. But I have likewise seem what seems like 90% of the software projects succeed very well at precisely the opposite end game approach. I like my approach better, and know people who strongly agree with me, but I also know how to read the look on the face of someone who really needs the end game energy. For the longest time, I was convinced they were just wrong, and would see it my way were they to live through a project done that way… but the fact is it just won’t work for them. They prepare differently the entire project. The wake up and go to work with a different motivation. They respond to problems differently.

Saturday, October 17, 2009

The fun of working together.

I am currently focused on finding and eliminating periodic latency spikes in our product. Pages are rendering pretty fast in general, well within our goals, but we still are not meeting goal on overall speed. There are moments of slow performance where a page for 97% of the day will be very fast, and then will suddenly start taking 10-20 seconds to render for periods lasting several minutes. Speed in this case is a reliability problem.

Up until recently we had been making a lot of progress looking for single operations on the back end that crossed resource budget thresholds. For example, if a database query was hitting the disk too heavily we filed a bug. This worked well for a while, but as we got closer and closer to goal, and as we filed lots of these kinds of bugs, it became more difficult to choose which ones were causing the biggest problem.

I wrote a tool to discover the time periods when pages were rendering more slowly. I specifically wrote it to count and measure clusters - how many clusters of slowness in a day, how slow did it get, how many requests are in the cluster, how long did the cluster last, etc. I told others about it.

We formed a team of three testers and two developers to look at these slow clusters on a daily basis. Each day is given to a member of the team and that person tries to discover the cause of the slow requests.

This is the part where working together is fun. We are inventing our diagnostic method as we go, and every day someone on the team comes up with a new cool way to find the cause of problems. Someone might say "There was a one hour period two days ago where all the requests were about 20 seconds long. The CPU wasn't heavy on either the web server or the database. The disk didn't look busy. What do I do next?" What happens then is this cool back and forth of ideas and creativity. People start writing new test tools, database queries and such to find the cause, or to show many times the problem is causing slow requests.

One example - a tester reporting to me wrote a query that extracts every time a specific event occurs on the server. He then lines that event with the slow performance events my tool identifies. Doing that he was able to say "this event account for 75% of the slow performance time periods on these two days.", which helped us decide whether or not to accept the code fix. Likewise, we had another fix whose effectiveness we wanted to evaluate before we put into the product. We installed the fix on our private production server, and compared frequency of the event before and after. Indeed the event frequency dropped, as did the correlated slow periods.

It is this rapid back and forth of creativity that makes working with people fun. A part of me wants to build the whole solution - have every idea, be the guy on everything. But another part of me really enjoys watching a group of people take an idea I had and make it even bigger and better.