Why NAS failed last week.

Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

You are not logged in or you do not have permission to access this page. This could be due to one of several reasons:

You are not logged in. If you are already registered, fill in the form below to log in, or follow the "Sign Up" link to register a new account.
You may not have sufficient privileges to access this page. Are you trying to edit someone else's post, access administrative features or some other privileged system?
If you are trying to post, the administrator may have disabled your account, or it may be awaiting activation.

Fraidycat replied

11 September 2023, 03:29
I always do TDD with 100% code coverage.

Simple to understand clean code.

So still see so many developers write zero tests and huge 500 line (or longer) methods, amateurs. Some developers have never written a single test case in their whole careers.
Leave a comment:
jamesbrown replied

11 September 2023, 00:26
Oh well, could've been worse. Ariane 5. Cough.
Leave a comment:
cojak replied

10 September 2023, 22:58
On that basis NAT, I think that the 4 hour window is the bit that airlines are peed about given the amount of money they have to cough up.
Leave a comment:
NotAllThere replied

10 September 2023, 07:05
Originally posted by vetran View Post

oh dear so they have duplicate waypoints , bet that was designed a consultant - muppets.

There's an ongoing global project to make all way point designations unique. Unfortunately it involves many different juridstictions which means it takes time, and so the current standard defines that there can be duplicates and the ways of dealing with it.

Originally posted by GJABS View Post

I wouldn't say the concept of a single flight plan failing for -some- reason requires particularly deep thought. On the face of it they ought to have designed the system to raise an alarm if a flight plan is rejected in an non-resolvable way immediately, and skip over to process the next flight plans. This would require a robust procedure to guarantee that the failed flight plan receives manual intervention. But dealing with manual intervention issues is nothing new to air traffic controllers (for example in their dealing with aircraft declaring emergencies).

The flight plan wasn't rejected, because it was valid.

There are flight plans with duplicate waypoints processed successfully every day. This was just happened to have something about it that the processing system couldn't handle. There was a failure somewhere in the logic of processing it which caused an unresolvable error, so the system did what it was designed to do. Fail safe. As a result although people were inconvenienced nobody actually died.

It's very easy to point the finger after the event and say "oh, they should have spotted that" but that could be said of the vast majority of bugs that get through into productive systems. Given the details supplied so far, we don't really know whether this was an easy one to spot. As it took over the four hour limit to resolve, either the IT workers are really crap or it was difficult to figure out what had gone wrong.

Apparently the error has been patched.

The fact is it is (provably) impossible to guarantee any computer system is free of errors. Even catastrophic ones.
Leave a comment:
cojak replied

9 September 2023, 22:51
But CUK errors don’t stop people from flying .
Leave a comment:
GJABS replied

9 September 2023, 15:05
Oh the irony. When I submitted my reply above, I got an error from CUK:
Invalid SQL: DELETE FROM cuk_cacheevent WHERE `event` IN ('nodeChg_4273261','nodeChg_4273064','nodeChg_19', 'nodeChg_16','nodeChg_2','nodeChg_1') /**cacheevent**/;
But we can be sure this error won't prevent subsequent posts from being made on here.
Leave a comment:
GJABS replied

9 September 2023, 15:00
Originally posted by Gibbon View Post

Their 'fault' was not thinking deep enough about the ramifications and allowing a single unresolvable FP (not erroneous) disrupt a whole system.

I wouldn't say the concept of a single flight plan failing for -some- reason requires particularly deep thought. On the face of it they ought to have designed the system to raise an alarm if a flight plan is rejected in an non-resolvable way immediately, and skip over to process the next flight plans. This would require a robust procedure to guarantee that the failed flight plan receives manual intervention. But dealing with manual intervention issues is nothing new to air traffic controllers (for example in their dealing with aircraft declaring emergencies). test
Leave a comment:
cojak replied

8 September 2023, 19:21
I am actually surprised about this, as I worked for an aerospace company and have spent a good few days in Hazard Identification workshops going through scenarios with a fine tooth comb (I was the ITIL rep).

Really, they must be crap at this.
Leave a comment:
Gibbon replied

8 September 2023, 17:54
Originally posted by vetran View Post

oh dear so they have duplicate waypoints , bet that was designed a consultant - muppets.

NATS don't allocate WPs they are overlaps from different standards that haven't yet been ironed out yet. Their 'fault' was not thinking deep enough about the ramifications and allowing a single unresolvable FP (not erroneous) disrupt a whole system. Failure of resilience, which real Safety Critical systems are full of of, i.e. a system stop is a FAILURE. Think Jet engines.
Leave a comment:
vetran replied

8 September 2023, 15:45
oh dear so they have duplicate waypoints , bet that was designed a consultant - muppets.
Leave a comment:
Lance replied

8 September 2023, 12:54
the system functioned as deigned.

However...
The requirements failed to identify a specific scenario that should be able to be resolved by the system. Therefore the system failed safe (as per the design).

Nothing more here than an expensive lesson learned with regards to scenarios that probably should have been considered.
Leave a comment:
cojak replied

6 September 2023, 14:09
Considering UK airspace is the most expensive in the world, I’m not surprised that airlines are p1ssed at the meagre 4hours of backup flight data.

I recognise the arrogance of senior management not understanding the the difference between proactive problem management and reactive incident management*, their Hazard Identification and Risk Management team must be as weak as dish water.

*I’m sure they do now though…
Leave a comment:
NotAllThere replied

6 September 2023, 13:53
To be specific FPRSA-R. NAS is the bit that comes after.
Leave a comment:
ladymuck replied

6 September 2023, 13:34
NATS, you mean
Leave a comment:
NotAllThere started a topic Why NAS failed last week.

6 September 2023, 12:11
Why NAS failed last week.

https://publicapps.caa.co.uk/docs/33...y%20Report.pdf

Interesting reading.
Tags: None