Tuesday, November 6, 2012

Testing the process of change ... lessons from HMS Invincible



Back in 2003 I worked for a company who provided the software used on the shipboard servers of the Royal Navy. The servers were used as an information network to share important documents between ships around the world. Surprisingly rather than all being top-secret intelligence, most of this was housekeeping. A naval ship essentially share many characteristics of an office, with a large number of people aboard who need to share information on activities, events, usage of disposables (you'd not be too pleased if your office ran out of coffee or toilet rolls), it is in many ways a floating community, and the Naval intranet allowed the organisation of many aspects of daily life.

I was a developer at the time, and following a course had been working (initially at home) on a Perl script which I got my managers permission to pitch to our customer. I'd written my application from scratch, but today we'd recognise many of it's attributes as a shipboard-Wiki (I called it the Generic Web Page, having never heard of Wikis at the time). It allowed sharing of information pages (which could be instantaneously modified by users with suitable permissions) between both personnel onboard ship as well as between other ship servers. Whereas before ships saved and exchanged information in document form (which took time to replicate on the Naval intranet), my ship-Wiki could be updated instantaneously.

The Navy were impressed. Although it was in their opinion not suitable for anything secure, it had a lot of potential uses due to it's speed over the document sharing method that was in use. They wanted to try it out in an upcoming NATO exercise onboard the UK flagship for the exercise, HMS Invincible.

Of course management was delighted to get such an enthusiastic customer response. But they had a fear – could I install my Generic Web Page application safely onboard the Invincible server? This was their dilemma … if I got this application working, it would be a huge statement about our companies can-do attitude. And if it worked well, we had all kinds of ideas to expand the product and provide more of these kinds of applications as a whole new line of work.

However if something should go wrong – then we risked knocking out the information sharing ability of the Royal Navy flagship in what was likely to be a billion dollar joint-forces exercise. The Royal Navy would look bad in front of it's NATO partners, and we would look unbelievably incompetent to our customer.

Risk and gain – all change has elements of both. What was decided, we took a replication of HMS Invincible onto a spare server, and we rolled our change on, we ran the server for a day. Then we rolled the change off, and checked we'd removed it.

We did this to develop a list of steps to apply the change, and to confirm we were drilled in it's use and application, but mostly to check and explore for any potential issues we might have. We wanted no room for error – so we did this in total of 4 times. Then on the 5th day, we used our steps on a completely different server to make sure there were no other potential surprises (unexpected configuration perhaps).

We produced a report, and our Naval adviser was satisfied we'd taken adequate precautions, so we were allowed to install on the HMS Invincible herself. That was quite an experience to get in the field and perform this change – amazing to be in the belly of such a grand titan. Did you know the ship has a small supermarket called the NAAFI inside that sells soft drinks, chocolate bars etc? Yes the ship has more people and shops than some villages I've lived in!

The installation was a success, and the software proved itself within the exercise. The Navy decided it wanted to develop that kind of capability, so the piece of work was extended to a much bigger program, although I didn't continue to develop on that project.

Testing the process of change

In testing we get quite used to our test environment – during the course of testing it undertakes so many changes and tweaks to get things right. Sometimes it's new builds (which are easy to track under configuration management), sometimes though there's a setting tweak that a developer tries which perhaps they made but forgot to write down.

At the end of testing, you produce an exit report which signs off that what is in your test environment has been checked, and seems acceptable for production.

But how do you confirm the change outlined for production will produce an environment which echoes the one you've signed off? For most releases to new systems, it's usually not a new build that's applied, but a series of changes just to modify elements of your applications and settings.

How can you test that your release team has all the changes needed for production? Well our enterprise on the HMS Invincible was a good start. Start with an environment under test which mirrors production, apply your changes, and test the end result as you would a production verification test. Then roll back, and confirm your changes have gone. Encourage the change team to repeat this process under as production-like conditions as possible (especially regarding timescales). Are any services interrupted during the change? Do any defects encountered during testing seem to come back (that is, you've missed a change somewhere)? Can you do the core, high value actions? Can you effortlessly roll back? Have you developed a definitive list of steps and actions that work every time?

It's amazing how many times we take for granted that what we sign off in our acceptance test environment will be delivered to production. Is your project making steps to ensure nothing's been missed?


3 comments:

  1. Solid thought. I will definitely note for future reuse and implementation.

    ReplyDelete
    Replies
    1. Thanks SoM. Good to get that feedback.

      Delete
    2. Thanks SoM. Good to get that feedback.

      Delete