Global IT Outage reported. (AKA 'Old lags reminiscing)!

This is the crux of it I agree. Of course companies generally don't understand and di risk management at all well, in particular not realising that the cost associated with a risk actualising can be put onto the balance sheet if they do it right, in the same way as opportunity -= "goodwill" - is recorded as an asset.
A very long time ago, I worked with a chap who'd been one of Arnie Weinstock's "bright young management accountants".

He was fond of quoting some of Weinstock's little gems, one of which was, allegedly, "if you put a lie on the balance sheet, you'd better remember what the truth was." Apparently this came down to making sure that the balance sheet only showed what you could sell for cash and how much it was worth if you did so. Given Weinstock's track record, that seems to me like advice to follow.
 
An interesting opinion. For anybody who doesn't have 14 mins

The gentleman in the video is suggesting.....
  1. CS gamed the system to run untrusted code in trusted mode (there are some very good reasons for doing this but it's risky because global meltdown)
  2. They also flagged their code as "must be run - definitely don't try starting your machine without this" (again, good reasons for that)
  3. They released incorrect code which had an unhandled exception (anybody who tells you this doesn't happen all the time is almost certainly a QA manager)
So, 3 crashed the "program" and because it's running in "really important" mode that crashes the computer. 2 stops Windows auto recovering from this. And 1 allowed them to release faulty code to the world.

Meanwhile I saw another article where MS were complaining that "an EU law" prevented them from preventing Crowdstrike doing this (probably from doing 1 but it was a confused article).

And here's what happened next.....

This is kind of up there with the floating point blowing up Ariane.

1721814172451.png
 
An interesting article - as ever The Register is on the money.
This is why greybeards like me (maybe not all of us though?) have reservations about Platform as a Service and Software as a Service, for the vulnerability it brings with no control over your own destiny. I guess it's hard to stop thinking of the days when everyone had their own computers and IT departments staffed with programmers, analysts and tech support; there were still crashes, but you had the staff to deal with it, and no or much less dependence on external resources and providers. Of course, we also didn't have the Internet, because the limited access to JANET and corporate WAN's made us safer from external hacking (many of us used bespoke ethernet rather than Netware), so we didn't need things like Crowdstrike etc, booting in kernel mode or from BIOS. Maybe my nostalgia is getting the better of me though.
Your view is not perhaps as dated as you think:

 
Your view is not perhaps as dated as you think:

Y-e-e-e-s......the point being of course that 37Signals is actually (now) a very big company. It's always going to be cheaper to buy something rather than rent it as long as you know exactly what you need. On Prem is great. Owned kit in a datacentre is even better (or slightly worse, depending) until you need to change something. The article mentions that 37S needs some pretty big computers for a small amount of time - but doesn't explain in any way how they have satisfied that requirement. That's the exact time that renting is cheaper. You can drive around in your Smart car all year until you need to get a new bed from Ikea and then rent a transit van. And rent an Atom for the days you want to show off on the track.
 
Anyway, back on topic....CloudStrike have explained what happened.

It wasn't that they didn't test their software (looked before they pulled out into a busy road)
Or that the test failed to find the problem (they spotted all the oncoming traffic)
It's that their deployment pipeline ignored the error (drove out onto a motorway regardless)

Strangely, I find that very plausible but not at all reassuring......

But not to worry, they did at least buy people a coffee and pack of Doritos - or at least tried to - https://www.bbc.co.uk/news/articles/ce58p0048r0o
 
So it was over-automated? The file exists so it must be ok, not checking that it contained garbage or was empty? That still sounds like a missing test to me.
 
So it was over-automated? The file exists so it must be ok, not checking that it contained garbage or was empty? That still sounds like a missing test to me.
It sounds like a CI/CD pipeline. It's common to have tests in these to make sure your code isn't doing something silly. It can either warn you "hey, this is 3% slower than last time" or stop the pipeline "nope - that's gonna nuke all computers, let's not".

They claim that the test ran, found the error but the pipeline continued running. That's just a different kind of faulty code from the kind we thought they had.
 
So it was over-automated? The file exists so it must be ok, not checking that it contained garbage or was empty? That still sounds like a missing test to me.
I agree.

It also sounds like they're a little light on both discipline and commonsense. :(
 
I have seen this before with CI/CD pipelines, exactly as you describe @JonathanRyan , where the results of assertions are recorded but not interrogated and with no exception cases catered for. The tools facilitate better processes but if they are not set up right or QA'd properly, the whole house crashes down.
 
My understanding was they didn't test the update and instead relied on a validator tool to spot errors, that tool failed to spot the errors so the deployment went ahead and caused the kernel drive to crash when it tried to load the update file. What would particularly concern me is that this isn't a one off and they've done this a number of times before but because they've crashed Linux rather than Windows systems, it's not had such widespread coverage. Also while recovery was fairly straightforward on a normal PC or server it was a lot more difficult on cloud or encrypted systems, perhaps highlighting the issues with using some of these cloud computing resources.

I can appreciate there's a difficult balance because if there's an exploit out in the wild then the faster they get their update out to cover it, the better however given the size and scale of Crowdstrike they clearly should have a better system for checking and testing their updates. I saw some comments that maybe the company needs to focus more on its own product rather than sports sponsorship...
 
Last edited:
My understanding was they didn't test the update and instead relied on a validator tool to spot errors,
The problem with any validation tool is that if different coders use different styles, some code will pass tests designed to catch a different way of coding; the old "return 0 / return 1" problem. Mind you, even the mark 1 eyeball can miss that if given too much code to check.
 
The problem with any validation tool is that if different coders use different styles, some code will pass tests designed to catch a different way of coding; the old "return 0 / return 1" problem. Mind you, even the mark 1 eyeball can miss that if given too much code to check.
I mean, not so much any more. Proper standards, automated regression and a drop of magic AI should solve most of those.

Of course, ironically yesterday I wrote my post during another massive Azure outage....
 
Indeed, the Azure issues yesterday affected our company systems to a degree, but it wasn't an outage entirely. We just had slow performance, exactly what you'd expect from a DDOS attack. Another reason why Cloud platform services are likely to go out of favour.
Re automated regression testing - speaking as a test specialist for the last 35 years - I struggle with it simply because it relies either with people deciding how many assertions to write for a given service call or class, but also where these are automatically derived, they are usually designed at the unit level at best rather than a macro level, where the behaviour of users rather than data drive the system. AI may eventually address that, but I doubt it simply because the aI needs to run for a long time capturing user behaviour before it can replicate and then extrapolate from the behaviour it has learned from. I still believe there cis a place for really good manual test analysts and manual testing.
 
Lawsuit initiated - but maybe not from who you might have thought!

LAWSUIT
 
I was reading an article that Delta are estimating their losses at around 500 million dollars from lost revenue and having to pay for accommodation and food for stranded passengers. I was also watching a video where the person was explaining Crowdstrike's liabilities showing that although in the legal agreement Crowdstrike don't take any liability for anything, that wouldn't protect them in the case of gross negligence which he believed this would count under. That could well likely lead to bankruptcy for Crowdstrike so you can understand the shareholders suing them first.
 
I imagine before long, once all the law suits are filed, they wil file for Chapter11 thus avoiding paying anyone in full, wiping out the shareholders and selling the IP assets to a new company.
 
Angry at the way some businesses seem to be run these days rather than the messenger!
 
Lawsuit initiated - but maybe not from who you might have thought!

LAWSUIT
Interesting. I'm not great at economics but it doesn't really seem like a smart move for their investors.

"Oh, they did something really dumb so we lost 30% of value. Let's sue the company we own into oblivion!!!".

Re automated regression testing - speaking as a test specialist for the last 35 years - I struggle with it simply because it relies either with people deciding how many assertions to write for a given service call or class, but also where these are automatically derived, they are usually designed at the unit level at best rather than a macro level, where the behaviour of users rather than data drive the system.
I think that's true if you are talking about automated test generation but not if you really mean execution. Why not run every test you've ever run before you deploy to prod?

As for all the cloud haters......several years ago, we needed to upgrade the disk space on an on prem SQL box. Ballpark quote was 20-30k but after a couple of months of negotiations we kind of lost interest and never did it. More recently, I needed to do the same on a VM. Click "yes" to the extra $50 charge, RDP and repartition the drives (whilst the DBs are up) and it's all done in 20 mins. Even faster on an Azure DB which you can scale almost literally on the fly.
 
Fair points Jonathan. I was talking about test generation rather than execution, yes.
My hesitation re cloud at scale is that my experience is that we've been sold the idea of easily spinning up additional test environments, but the cost means it is rarely done and thus we have had to compromise the testing we do (especially performance testing) or have had to do the old-fashioned thing of paying humans to work antisocial hours to use the machines when others aren't, which is counter-intuitive: I thought liveware was more expensive than hardware these days!
 
Interesting. I'm not great at economics but it doesn't really seem like a smart move for their investors.

"Oh, they did something really dumb so we lost 30% of value. Let's sue the company we own into oblivion!!!".
I think it is potentially a smart move given there's a real possibility the company is going to oblivion shortly so the shareholders want to get in there first before the money is all gone. There's a lot of big companies who have lost a lot of money some of which I wouldn't be surprised if they want to sue Crowdstrike themselves. However even if that doesn't happen I think Crowdstrike are going to struggle since their name is now associated with one of if not the biggest IT outage in history, they've been heavily advertising their name to expand however I think they're going to struggle with that now and likely looking at losing customers.
 
Last edited:
The reasons the outage happened is way above my understanding of computers but the countries listed affected did not include the UK. - Oh yes it did, Railways, airports shops and banks. and for a lot of people the NHS.

I had an appointment that morning and fortunately being cautious my GP's surgery always save and print out a copy of what the appointments are the following day so I was OK. However they were still struggling with issuing on line prescriptions, and the problems were not finally cleared up until the next week.

It just shows how fragile the internet is, and this was a technical blunder, not a malicious attack by our 'friends' elsewhere in the world. Is this a wakeup call? - Nah, I doubt it, they will do precious little about it, and blunder along from crisis to crisis!
 
Last edited:
It just shows how fragile the internet is,
The Internet is a tough old boot.

The problem is that some of the things that people run across the Internet are fragile at best.
 
It just shows how fragile the internet is, and this was a technical blunder, not a malicious attack by our 'friends' elsewhere in the world. Is this a wakeup call? - Nah, I doubt it, they will do precious little about it, and blunder along from crisis to crisis!
Yes it is a highlight of a growing problem with the internet where singular points of failure (particularly with CDNs more recently) can cause widespread problems.
 
Back
Top