Hey there!
After having the biggest IT outage in history last Friday morning. I didn´t want to miss the opportunity to share some thoughts on it and try to bring some key lessons to the table. Besides this outage being the largest we’ve ever seen, I can´t stop thinking about the level of interconnection legacy systems have with new layers of services, and the importance of always having a QA protocol with every integration overseeing as many scenarios as possible. To be honest with you after 15 years working in Tech, it surprises me this kind of thing doesn´t happen more often. I´ve seen more than once how major updates are launched straight to a production environment. There is no way to create a sandbox that can reproduce every possible scenario. But then, where to leverage AI and Automation, and when to bet on human supervision?
Automation and AI are the latest trend in Tech, but they are been around for a while. We have been looking for a long time after the promise of abundance and scalability. The idea of making our lives easier, our processes more efficient, and our software more secure, is super sexy. But as we saw with the recent CrowdStrike outage, over-relying on these automated systems connected with legacy without proper human supervision can lead to some serious trouble.
Let's break down what happened and what we as organizations can learn to avoid similar issues when deploying large-scale updates:
1. Automation Isn't Infallible
While AI and automated systems are powerful, they’re not flawless. The CrowdStrike outage was triggered by a logic error in a sensor configuration update that was automatically deployed. Even the most advanced algorithms can make mistakes, and without human oversight, those mistakes can quickly spiral out of control. When you work with automation at that scale, the consequences are measured at the same magnitudes.
2. Context Is Key
Humans have a unique ability to understand context in ways machines just can’t. AI can crunch numbers and spot patterns, but it often lacks the nuanced understanding that comes from human experience. When it comes to deploying critical updates, it’s essential to have people involved who can interpret data through the lens of organizational goals, and probably interconnect each major use case with its potential risks.
3. Adapt and Overcome
The cybersecurity landscape is constantly evolving, with new threats emerging all the time. Humans are inherently more adaptable than machines; we can pivot strategies and approaches based on changing circumstances. Automated systems, on the other hand, are limited to the parameters they’re given. Maintaining a human element in the update process allows for that necessary flexibility.
4. Collaboration Is Key
Integrating human oversight into automated processes isn’t about replacing AI—it’s about creating a collaborative partnership. Encourage your teams to work alongside these automated systems, understand their capabilities, and know when to step in. Invest in training that equips employees with the skills to manage and oversee large-scale automations effectively.
The CrowdStrike incident serves as a wake-up call. While the allure of automation is strong, we can’t lose sight of the value that human judgment brings to the table. By embracing a balanced approach that combines the strengths of both technology and human oversight, organizations can navigate the complexities of the new change of paradigm with confidence and resilience.
So, the next time you’re rolling out a major update, remember: that automation is powerful, but it’s not all-powerful. Keep those humans in the loop, maintain robust review processes, please please use test environments, and don’t be afraid to adapt as needed. Memes are a great part of these historic moments, but I am sure nobody wants their brand or their face in one of them.