NatWest Borked

**NotAllThere** · 8 March 2013, 06:33

Originally posted by NickFitz View Post

It was "hardware failure" apparently: RBS Says Computer Failure Is Unrelated to Last Year

They really should tell the cleaner where it's safe to plug that vacuum cleaner in.

Edit: this is not the explanation for the Natwest failure. This was a different company a few years ago.

I was present at a meeting when the entire SAP system was bobbed. The failover of the db server to another server failed, because the failover monitor could see the db server at one level, but was waiting for a response at another, which it was never getting. The operations manager told the data centre, in India, to switch off the db server. Turn it off. That way the failover monitor would notice the db server was not there, and switch to the shadow db server.

Apparently "Switch the database server off" was too hard a concept for our offshore colleagues to understand, since they instead shut everything down; all the application servers. Rather than users experiencing a hanging system that then started working again, they lost connection and their work in progress.

On restart, the shadow server wouldn't come up. A network card had failed, which would require four hours to be replaced, despite spares being on hand, since no-one in the data centre had any technical knowledge or ability whatsoever. Eventually, a techy guy in the UK managed to talk to the shadow db server over on of its other network cards, persuaded it to ignore the failed card, and so the system was restored.

The announced cause of the outage: hardware failure.

**suityou01** · 8 March 2013, 07:22

Originally posted by Platypus View Post

Since he tailored his CV

**darmstadt** · 8 March 2013, 07:39

Originally posted by ctdctd View Post

Don't be silly Nick, it's a mainframe system with multiple redundancy according to Elreg.

I think there were two cleaners with vacuum cleaners.

This goes to show the incompetence of NatWest IT department. I set-up, support, design, etc. these type of systems (parallel sysplex, geographically dispersed parallel sysplex, pprc, etc.) and this has never happened in any of the systems I've worked on. I actually built the systems that IBM use to design and test this stuff and tested such errors, and worse, and a 24/7 operation works. As people have said, they probably made the mainframers redundant and invested in 'newer' technologies. Most IT departments are run by ******* idiots in the UK (and USA) anyway...

**SandyD** · 8 March 2013, 08:06

Originally posted by NotAllThere View Post

I was present at a meeting when the entire SAP system was bobbed. The failover of the db server to another server failed, because the failover monitor could see the db server at one level, but was waiting for a response at another, which it was never getting. The operations manager told the data centre, in India, to switch off the db server. Turn it off. That way the failover monitor would notice the db server was not there, and switch to the shadow db server.

Apparently "Switch the database server off" was too hard a concept for our offshore colleagues to understand, since they instead shut everything down; all the application servers. Rather than users experiencing a hanging system that then started working again, they lost connection and their work in progress.

On restart, the shadow server wouldn't come up. A network card had failed, which would require four hours to be replaced, despite spares being on hand, since no-one in the data centre had any technical knowledge or ability whatsoever. Eventually, a techy guy in the UK managed to talk to the shadow db server over on of its other network cards, persuaded it to ignore the failed card, and so the system was restored.

The announced cause of the outage: hardware failure.

Thanks for the clarifications, hope you didn't post the above from your workstation at work mate !

**doodab** · 8 March 2013, 08:35

Originally posted by NotAllThere View Post

I was present at a meeting when the entire SAP system was bobbed. The failover of the db server to another server failed, because the failover monitor could see the db server at one level, but was waiting for a response at another, which it was never getting. The operations manager told the data centre, in India, to switch off the db server. Turn it off. That way the failover monitor would notice the db server was not there, and switch to the shadow db server.

Apparently "Switch the database server off" was too hard a concept for our offshore colleagues to understand, since they instead shut everything down; all the application servers. Rather than users experiencing a hanging system that then started working again, they lost connection and their work in progress.

On restart, the shadow server wouldn't come up. A network card had failed, which would require four hours to be replaced, despite spares being on hand, since no-one in the data centre had any technical knowledge or ability whatsoever. Eventually, a techy guy in the UK managed to talk to the shadow db server over on of its other network cards, persuaded it to ignore the failed card, and so the system was restored.

The announced cause of the outage: hardware failure.

are you the NAT in Natwest?

**NotAllThere** · 8 March 2013, 08:39

Originally posted by doodab View Post

are you the NAT in Natwest?

This was not Natwest, and it happened a few years ago. The point is that "hardware failure" doesn't mean there wasn't some human f.sk up somewhere.

**Sysman** · 8 March 2013, 08:39

Originally posted by darmstadt View Post

This goes to show the incompetence of NatWest IT department. I set-up, support, design, etc. these type of systems (parallel sysplex, geographically dispersed parallel sysplex, pprc, etc.) and this has never happened in any of the systems I've worked on. I actually built the systems that IBM use to design and test this stuff and tested such errors, and worse, and a 24/7 operation works. As people have said, they probably made the mainframers redundant and invested in 'newer' technologies. Most IT departments are run by ******* idiots in the UK (and USA) anyway...

WHS.

**vetran** · 8 March 2013, 09:33

Originally posted by Sysman View Post

I read some of the comments about multiple redundancy with a chuckle.

Once you have seen a single UPS, which was just one of many, take out a whole building you see things with a bit more scepticism.

now that shouldn't happen, sack the architect.

**doodab** · 8 March 2013, 13:41

Originally posted by NotAllThere View Post

The point is that "hardware failure" doesn't mean there wasn't some human f.sk up somewhere.

In my experience it usually means there was.

**NickFitz** · 8 March 2013, 13:59

Originally posted by doodab View Post

In my experience it usually means there was.

If we use the definition of hardware being "the stuff you can kick", then people are hardware.

NatWest Borked

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

NatWest Borked

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Partners

Advertisers

Contractor Services

CUK News

Tag Cloud