Discussion about this post

User's avatar
Neural Foundry's avatar

Really great breakdown of these incidents. The CrowdStrike one really hit home because it shows how a single null pointer chek in kernel mode can cascade into a global disaster. What strikes me is how all three of these cases had the same undrlying issue, speed was prioritized over safety checks. CrowdStrike bypassed Windows certification, Google replicated bad data globally in seconds, and AWS had no validation on their DNS system. I think the biggest takaway is that we need to treat configuration changes with the same rigor as code changes. Too many teams still think configs are less risky than actual code, but as you showed here, a bad config file can be just as devastating as a bad code deploy. The gradual rollout point is huge too, there is no excuse for pushing changes to 100% of systems simultaneously anymore.

Expand full comment

No posts

Ready for more?