(Didn't)KnowBe4, (In)Secure Boot, 91Won't, and, of course, CrowdStrike

MegaZone is the editor once again this week, and what a week it was. The news cycle was absolutely dominated by a single event, an event which has been called the largest IT outage in history. I refer, of course, to the CrowdStrike Windows outage, which I've seen some starting to call 'CrowdStruck'. (I'm not sure I care for that myself.) My colleague AaronJB touched on this last week, but the news cycle had only just started at the time. There was an avalanche of news this week - and it still hasn't stopped coming as I write this.

While CrowdStrike dominated the news, it wasn't the only thing that happened last week, so I'll touch on a few other items that caught my attention.

In other news, I was on the panel for the latest F5 AppSec Monthly podcast, which we recorded last week. I understand it is in editing and may be posted by the time this article goes live. It was my first time on AppSec Monthly and I hope to do it again.

(Didn't)KnowBe4

Cybersecurity firm KnowBe4 thought they were hiring a fine upstanding young man from the US as a Principal Software Engineer on their internal IT AI team, only to have him whip off his metaphorical hat and say imagine that, huh, me, a North Korean agent, working for you. (Sorry, not sorry.) To their credit, KnowBe4 has been very candid about the events and have recommendations for avoiding such things in the future.

I was tempted to refer to him as a 'sleeper agent', but he was about as stealthy as a one-man band. And that has sparked a few internal discussions about this item around the old (virtual) watercooler. What struck me and some of my colleagues as the apparent mismatch between the level of effort and sophistication that went into securing the position, and the apparent incompetence displayed in exploiting that position.

Obtaining the job involved using stolen identity that passed background checks and had verifiable references, as well as having someone able to make it through the interview process to be hired. And that process involved four video interviews with a person matching the photo provided with the initial application. The latter was achieved by using AI to modify a photo to resemble the plant. It seems like some serious time and effort was put into getting their agent in the door - not to mention managing to get their candidate into the interview process in the first place. Anyone who has ever been involved in a hiring process knows you don't just get one resume.

Getting their agent in the door was winning a jackpot. All they had to do was play it cool, invest some time to gain trust and access, and then they could quietly extract data, put backdoors into the code (remember, the agent was hired as a software developer), maybe exploit the trust KnowBe4's customers grant them, etc. Or just make a decent IT salary to funnel to the regime. James Bond stuff.

Instead, they got Ace Ventura. As soon as the agent received their shiny new Macbook from KnowBe4 they started stuffing it full of malware. And not some sophisticated spy tools or something, but apparently standard, off-the-shelf malware that detection software immediately flagged. And then, when KnowBe4's InfoSec team reached out to help, the agent seems to have panicked and bolted. Seen here in a re-enactment. One interesting tidbit is that the SOC determined the agent was using a Raspberry Pi to download the malware to the Macbook.

All of that effort, and luck, to get into the race - and they blew it right out the gate. It's like more senior people handled the hiring process, but then handed the role off to the team newb to execute, and they tripped over their own feet. I do kind of wonder what repercussions that may have had for the hapless agent. I suspect North Korean leadership aren't really the 'a mistake is just a learning opportunity' types.

While this particular incident worked out for the best in the end, it is obviously not the only time North Korea, or other nation states, have taken this approach. Which means any one of us could be unknowingly working with agents of a foreign power. The problem of identifying all such agents is one that could keep IT security leaders up at night.

(In)Secure Boot

Next up on the agenda is PKFail, announced by Binarly REsearch, which undermines UEFI Secure Boot on hundreds of x86 and ARM platforms. PK in this case is neither Player Killer nor Peace Keeper, but Platform Key - specifically the one belonging to American Megatrends International (AMI). This Platform Key is meant to be a secret key at the root of a trust chain. The short version is that a Platform Key meant to be used for testing, likely stemming from a reference implementation, and never meant to be included in shipping products - was. As the key is widely available, it can be used to convey trust onto malicious code as systems that shipped with the key will trust what it is used to sign.

Furthermore, the private component of one Platform Key was discovered to have been leaked when source code containing the key was uploaded to a GitHub repository. These test keys have passed through the hands of countless developers, and, as they were never intended to be trusted in production systems, they haven't been treated as sensitive information. Binarly states that the first firmware subject to PKFail was released in May 2012, and the latest they detected was released in June 2024. The full list of known affected vendors and products has been published in an advisory.

91Won't

Have you ever had a 911 call fail? You will. And the company that'll bring it to you? AT&T. (Yeah, those ads are from 30 years ago. Kind of interesting to see how close they got, or didn't get, to how things are now. But I digress...)

While CrowdStrike is the big outage on everyone's mind currently, back in February there was another outage which had serious, if much smaller in scope, effects. AT&T suffered a 12-hour system outage in the US that took out voice and data services for 125 million AT&T Mobility customers. This blocked over 92 million phone calls. More critically, over 25,000 911 emergency calls were affected. This really got the attention of the FCC, who just released their report on the incident.

Like most such failures there was no one single cause, but a chain of bad processes, complacency, shortcuts, and human error with snowballed into a disaster. The trigger was one tech deploying a single, misconfigured device onto the network. This caused a cascading failure as it propagated through the network, isolating one cell tower after another. But to get to that point required people taking shortcuts and not following the proper processes to verify the device configuration, including a peer review. And then the failure to test the device after installation, before allowing the change to spread.

The report identified insufficiencies in lab testing before deployment, change control, and controls to arrest the propagation once it started. As well as issues with the systems and processes needed to recover the affected devices, which prolonged the outage. In the swiss cheese model, lots of holes lined up to make things as bad as they were.

This wasn't the first time the same types of failures caused a widespread outage - and it certainly wasn't the last, right Crowdstrike? It can feel like the industry isn't learning from the ample examples provided. I'm not sure what will change that, if anything, but it'll probably require some kind of serious fiscal consequences to move the needle.

You've been.... Kernel Struck!

Yeah, I know, but 'You've been... Crowd Struck!' doesn't fit the meter.

All jokes aside, I struggled with how to tackle this issue. After all, it was front and center in both the tech and mainstream press for most of the week. What can I say that hasn't already been said, and probably seen or heard by everyone reading this already? Often, when compiling news for TWIS, the goal is to highlight stories which the editor found interesting, but which may have flown under the radar for you, our readers. And perhaps to offer some thoughts on the issue which weren't part of the existing coverage. With Crowdstrike there was a firehose of coverage and you would've had to be hiding under a rock all week to have missed it. And so much has been said already that I'm not sure I really have anything unique to say.

On the other hand, this is being called the largest IT outage in history, and it was the story of the week, so it doesn't feel like we can very well not include it.

So I struggled with how to approach this story - so here it is...

I happened to be up, working late, when it started. Sometime around 01:30 Eastern on July 19th I started having trouble accessing F5 internal servers, getting a DNS error instead. Since they were internal I thought maybe my VPN was having an issue, so I tried connecting to a different endpoint, but still had trouble. Some hostnames were resolving while others were not. While I was troubleshooting the issue, my laptop bluescreened. That was a surprise; I can't remember the last time I had a Windows bluescreen.

Fortunately, I wasn't one of the lucky winners of a boot loop and my laptop came right back up. I got reconnected and by 01:44 I was back on our F5 SIRT Teams channel to say I'd just had the bluescreen, and there was already a new thread from a colleague in our Singapore office at 01:40 saying they'd just had PCs in the office start to bluescreen. We figured out pretty quickly it was hitting everyone with Windows as the few folks with Macs weren't having problems - other than those due to Windows servers being down. The first comment pointing to CrowdStrike was at 02:02 Eastern, so it didn't take long to figure out.

We got to watch it unfolding real time, which was something to see. Very rapidly, within the hour, there were reports of flight stops across the world as airlines, airports, and related systems were impacted. Reports of hospitals and banks being down. Retail chains closing because their point of sale systems were down. It was like a Hollywood disaster movie unfolding in real time. Even then it was clear it was going to be a massive event. F5's IT department acted very quickly to restore the majority of services rapidly, but they were working through the weekend to complete the cleanup. This made for a hellish weekend, and beyond, for IT departments worldwide.

CrowdStrike's CEO apologized the day of the incident (19th), and they had the technical details posted on the 20th. By the 24th they has a preliminary post incident review available.

All of this was damage was done by by an update that was available for only 78 minutes and impacted less than 1% of Windows systems worldwide (around 8.5 million of them), yet caused an estimated loss of USD $5.4 billion from just the Fortune 500 (excluding Microsoft):

Customers running Falcon sensor for Windows version 7.11 and above, that were online between Friday, July 19, 2024 04:09 UTC and Friday, July 19, 2024 05:27 UTC, may be impacted.

The problem was not with a software update, but a bad signature update used by the Falcon sensor.

CrowdStrike delivers security content configuration updates to our sensors in two ways: Sensor Content that is shipped with our sensor directly, and Rapid Response Content that is designed to respond to the changing threat landscape at operational speed.

The issue on Friday involved a Rapid Response Content update with an undetected error.

The key phrase there, for me, is undetected error. Given that any given Windows system that loaded this update was very likely to bluescreen, it begs the question of how this could have possibly made it through quality assurance or release testing. CrowdStrike claims the problematic file made it through their automated testing due to a bug in their Content Validator. And, since this was a minor update to a previously released file, the more extensive testing performed on the first release plus the trust in the results of the Content Validator gave the update the green light for release, without the problem being discovered.

As this were unfolding that first day, social media was full of armchair system admins blaming anyone who was 'stupid enough' to allow live software updates in production, etc., etc. But, as I mentioned above, this was not a software update. Indeed, CrowdStrike recommends that their customers run software versions n-1 or n-2 in production - not the latest release. They recognize the risk of running a bleeding edge software release in production. This was a rules update meant to catch emergent threats. That's exactly why you run a product like Falcon in the first place. The point is to have rapidly updated malware detection to help block emergent threats, and that requires rapidly updating production systems - hence the Rapid Response Content.

If you were to stage these rules updates in a testing environment and run them for a while before deploying to production, that would defeat the purpose of a rapid update. You would gut the capabilities of the product to defend production systems against emergent threats. That's the tradeoff - you're trusting a vendor to provide rapid updates to reduce your risk from emergent attacks, which means accepting some risk from untested (by you) updates to your production system. I'm sure most who were running Falcon believed that a simple rule update was a very low risk compared to a software update. I don't feel they deserve the kind of dismissive judgement that was going around, that's victim blaming.

Now that the unexpected has happened, the choice remains the same and the decision may not change. Do you take the risk from not having updated protection? Or do you take the risk that CrowdStrike will make the same mistake again? There's a solid argument that, with the lessons learned from this incident, the risk will now be lower than it was before. I don't think there is a right or wrong choice, it will come down to the evaluation of risks done by each customer.

The cleanup was just the beginning.

The Coverage

The following are only a sample of the tsunami of coverage for this event, it isn't even all of the articles I read myself. It is only representative of the rapidly evolving situation and the constant flow of new, updated stories related to it. And each of the articles is likely to link to further sources and coverage. This rabbit hole runs deep.

7/19

7/20

7/21

https://www.theregister.com/2024/07/21/crowdstrike_linux_crashes_restoration_tools/

7/22

7/23

7/24

7/25

7/26

8/1

As you can see, the news was continuing to roll in as I wrote this issue. While the outage itself may be resolved, the fallout will continue for a long time to come. This seems to have the very real possibility of ending CrowdStrike if they lose the lawsuits being filed against them and they are forced to pay substantial compensation. That would come on top of the massive drop in their stock value, and the very likely loss of business as potential customers steer clear.

I am thankful to not have been involved in the critical incident this time, and do not envy those who were in the thick of it. We owe a lot of thanks to the folks who put in longs hours cleaning up the mess as rapidly as they did.