Categories
I See I Write

Digesting The DBS and IBM Detail Findings Of 5 July Outage

I work in a bank.  Prior to that, I have spent more than a decade working as a IT and management consultant serving primarily the financial services industry.  I am also a customer of DBS.  Certainly, I have high anticipation on the detail findings on what went wrong on July 5th when DBS suffered an unprecedented systems failure which prevented its customers from using the ATM or accessing their accounts (quote from The Straits Times).  A month has passed and the report has been made public.  Before I go on happily talking about what I think, here is a disclaimer (required by the one who pays my bills).  This entry – and everything you read in this website, really – represents opinions in my own capacity.  And I am not a spokesperson for any entity.

I have gone through the detail findings a couple of times (click here to view).  I would have thought digesting the findings should be cake, given my background.  But the findings, to me, are far from detail.  Confusing to read.  As far as the story goes, one fine day, IBM had detected an instability in the communication link between DBS’s mainframe computer and its storage system.  And because of that, an engineer was sent to DBS site to replace a cable.  This happened on July 3, 11.06am.  Here is a rundown of event anointed with my thoughts.

  • July 3, 7.50 pm –  The IBM engineer, as per the instruction given by the IBM support center, replaced the cable.  The report claims that the instruction is incorrect but appeared to have solved the problem.  I deduce from the report that the correct way is to use the machine’s maintenance interface (which I can understand).  So what did the engineer do?  Did he just yank out the cable?  As a IBM certified engineer, would he have known what are the steps involved to replace a cable?  I have no idea.  The report does not say much.  I can only speculate.
  • July 4 – 2.55 pm – All of a sudden, the problem reappeared.  This time, both the cable and the associated electronic cards had gone unstable.  The IBM engineer then escalated the issue to the [IBM] data center.  I suppose that was a logical thing to do.  Because replacing the cable did not seem to fix the problem.  Something else must had gone wrong.  The report does not go as far as to say that if the cable was to be replaced correctly in the first place, the issue would have been fixed for good.
  • July 4 – 5.16 pm –  After more than two hours of, I suppose, deliberation inside the support center located outside Singapore, the instruction was to: try reseating cable (my wild guess is that reseating means unplugging and reconnecting the cable).  So the IBM engineer did just that and the problem seemed to have gone away.  So why did the problem appeared to go away in two counts?  Nobody knows.  This detail findings report does not say.  The support center might have deduced that the problem was due to a loosely connected cable.
  • July 4 – 6.14 pm – Again, all of a sudden, the problem reappeared (by the way, I have worked in the technology line before and I know very well never to celebrate too early).  This time, the support center spent more time analysing the problem and appears to insist that it was still a problem with the cable.  So the IBM engineer reseated the cable.
  • July 4 – 11.38 pm – This time, the problem did not go away.  So as per the support center’s instruction, the IBM engineer reseated the cable again.  It did not work.
  • July 5 – 2.50 am – DBS was contacted to authorize a cable replacement at a quiet hour.  Previously, the cable was changed at 7.50 pm.  So I can only imagine that there may be some batch programs running during the midnight window.
  • July 5 – 2.58 am – The IBM engineer replaced the cable the same way as before.  And unlike the last time, the storage system detected a threat to data integrity and had stopped working in order to protect its data.  The million dollar question is: Why the storage system did not cease to work when the cable was replaced using the so called incorrect steps on July 3?  Something else must have killed the system but the report does not say.  What exactly did the engineer do that was different from before?  Why did the problem seem to have gone away after the cable was replaced the first time, reseated the first time?  Why did the problem reappear?  I am not convinced that these incorrect procedures have caused the outage, as quoted from the report.
  • July 5 – 12.30 pm – The banking services were fully restored.

Half a day to bring the system back up?  Procedures aside, what happened to the disaster recovery system?  You mean, there is none?

We can only read the clues by decoding the extra steps MAS has asked DBS to take.

When I read action items, I often examine the rationales behind each item.  More often than not, gaps are identified and in order to close them, action items are derived.  I doubt if the public would ever know what has gone wrong.  Looking at each item, it seems to me that MAS is concerned on the single points of failure (exactly my thought), not happy with how DBS managed and handled the situation, and above all, not happy with IBM.

It does not come more obvious than this:

diversify and reduce its material outsourcing risks so that it does not overly rely on a single service provider or a single vendor’s products and services

2 replies on “Digesting The DBS and IBM Detail Findings Of 5 July Outage”

IBM has many subcontractors working for them in Singapore sourced from placement agencies. Certification? Certified engineer? Do you think job agencies send their folks for regular training and refresher courses? They are lean on costs. This episode shows the ‘rot’s within IBM Singapore and the culture of outsourcing risks and management to 3rd parties thinking they are the answers for all your IT needs and solutions. It is very hard to get experienced people (e.g.many burned-out after OTing over prolong periods) and not paying what they are worth (bonuses and benefits are foreign terms to contract workers). The escalation shows a glaring lack of basic competencies at many levels at IBM. Why is that so? IBM knows why but it will never tell. MAS is correct to order DBS to split its reliances on 1 service provider. However, IBM is king in mainframe.

I can totally relate to your comment. IBM will never tell, looking at the quality of information disclosed in the detail findings report. And I doubt if DBS or MAS has a clue on what went wrong. It would be hard to tear out a portion of the outsourcing process out of IBM in a short time. I suppose DBS would take MAS’s advise on doing in-house system monitoring. But it is equally hard to build the skill within DBS in a short time. Perhaps insourcing would be an easier way out. And to free up the millions in censure for other better business opportunity, come to think of it.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.