Wednesday, March 13, 2013

Novopay - the tale of a compelling event in the New Zealand IT industry

Open almost any book on testing, and it will start with a "cautionary tale" about software testing, where something was released into production, and it went wrong with devastating results.  They are the ghoulish "tales around the campfire" of our IT industry.

One problem I have found within the New Zealand industry is that as testers we love to focus on anything that goes "bang" and crashes.  In 2011, I was in one a talk about software testing to a group of new developers as part of the Summer of Tech, where our introductory speaker spoke of an Ariane 5 rocket where the parts had been individually tested, but put together into a new rocket.  As you can imagine, this rocket barely cleared the launchpad before it exploded. Another cautionary tale of "you didn't test enough".

What's interested is what happened next.  One of the audience interrupted with "but we're creating applications ... not rockets.  Our stuff is hardly critical like that".

To me, this reinforced how vital it is to have stories and tales of failure which are relevant to the audience.  I thought the tale was a wonderful one, but looking around the room, I saw it failed, because the audience simply did not believe it applied to them.

New Zealand is a small and very pragmatic country.  A lot of processes here are still fairly manual compared to my home country of England, and overall that's not a bad thing.  Here in New Zealand if something isn't broken, New Zealand attitude is "why try and fix it", so,

  • In the UK for trains we have automated ticket vending machines (usually vandalised), automated turnstiles to get onto platforms etc.  We used to get told our ticket prices would have to go up above inflation to cover this automation and it's continual repair.  In New Zealand they just have an old fashioned conductor on the train who checks and clips tickets, and can sell you one for cash if you need one (without charging a fee).  Simple, but y'know it works.
  • In the UK there were plans to build a so called "chip and bin" system for collecting rubbish.  The UK rubbish collection vehicles would scan your bin, which would then be weighed as it was disposed of.  Back at the main office this data would be collected, and an itemised bill created (your bill would have to be more to cover all this new technology which would have to be developed and maintained).  Again being super pragmatic, New Zealand councils just sell you council marked rubbish bags which have a charge on them for collection.  Only rubbish in council bags is disposed of.  The more you throw away, the more bags you need.  Elegantly simple yes?

An unfortunate by-product of this pragmatic way of doing things is the attitude of "she'll be right", which means if something goes wrong, don't worry, we'll be able to fix it.  This means on the whole New Zealanders can be a little bit more risk takers than their European or American counterparts.

In a testing consultancy I previously worked for, my test manager would talk about how New Zealand would have to face an inevitable "compelling event" for software testing.  Most other countries have had one, but so far although there had been failed there hadn't been one in New Zealand.

A "compelling event" is something that forces you to take action - a wake up call the the importance of the value of testing.  A compelling event isn't just software going bad when it hits production, it's software causing a very large and public pain that it unnerves the local IT industry as a motivation to not repeat that mistake, often with a gasp of "that so easily could have been us".  It's not a rocket blowing up half way around the world, it's software that fails but feels too close for comfort compared to what you're doing right now.

It's a compelling event because once it's happened it compels you to take a good hard look at your strategies around testing and quality and ask "are we (and not just testers here) doing enough?".

Sadly, we've finally had our big compelling event here - the Novopay saga.  Novopay was an online system for managing the payment of teacher and school staff salaries.  But sadly it has gone into production to go horribly wrong, with payments to these people missing in the system.  There have been schools where teachers are missing months of pay (and teachers aren't exactly in the most affluence of careers).  More than just being "missed payments", there have also been erroneous payments gone out from schools - some teachers who've never worked for a school, are receiving payments from that school - and schools are having to bring in extra staff to go through the Novopay payments with a fine tooth comb to work out what's payments are good, what payments are errors and what payments are missing.

The media have been in a frenzy over it, hounding senior staff at the Ministry Of Education (who are behind Novopay) saying they have found a report of "200 known bugs" in the production system when it went live.  It is terrifying to me as a senior tester is how bad that sounds when the media put it like that.

Of course when I worked on an avionic computer, it'd surprise people to know it flew under the strictest of safety conditions, but still there were 1,500 known bugs in the software.  The important message though isn't the bug count, but the severity.  But this is a very difficult dialogue to have with a set of press hounding for a "big story" and "potential coverup".

In an unusual move, the Ministry of Education has released the test plans for the Novopay project into the public domain.  I took a look through, and they show a fairly thorough planned approach was taken to testing, not the slap-dash "rush into production" that many in the press are claiming.  That was pretty unnerving - most of our "testing horror stories" like the Ariane rocket involve people simply not testing, thinking it would be "alright mate".

The documentation they have also shows a decent approach to several forms of functional testing, indeed going beyond what I would have initially thought to do,

However for all it's success in testing, something has gone very wrong in production, there can be no doubt about that.  And it is causing real grief to people whose life it should be easing.

What went wrong?  Each of us will have our opinions about this, and no doubt all of us will be right to some extent, whilst at the same time knowing in our hearts how easily something like this could happen to us.

Testing is about identifying and helping to remove risks, but it does not guarantee a bug-free product.  We can test in the many different ways we expect our software to be used, we can check for all the problems we can imagine could happen.  But that will never cover everything, there will always be issues beyond our imagining.

But the imagination of any individual has it's limits.  We have to be careful not to find ourselves screaming "inconceivable" at every bug found in production.  In my opinion, this is why testing and quality is not just a "test team" ticket.  We need to be engaged with developers, business analysts, market managers, usability experts, end users to work out a whole scope of things to test (from areas of functionality to ways of using) that is beyond the imagination we as testers can summon when looking at requirements.  But by then our scope of testing has probably increased by several orders of magnitude, and we don't have unlimited time.

That's when we need to do some risk based analysis, and cover as much of it as possible, touching on all the "most likely" cases, then touching on samples in other areas.  If you find problems, then odds are there'll be others, so keep looking in that area.

For myself then , if there is a compelling lesson to be had from Novopay it is that we need to draw up our test plans as we always have done, and then ask of others "what could I have missed".

No comments:

Post a Comment