Ability to control vast infrastructures was always a matter of critical importance. Starting from the ancient civilizations to modern world computer games, ability to remain in control of key assets and resources is always a key to winning. And lack of such control has always led to demise of empires and loosing battles to more agile rivals. This topic was illustrated countless times in historical books and movies culminating in my personal favorite fiction movie “The Matrix” where Neo is struggling to differentiate between the real world and virtual one, controlled by extremely advanced machines.
The weak link
In early Internet days, networks and servers were controlled by human operators manually or via rudimentary network management systems like Ciscoworks and HP OpenView. Neither method produced good results and large parts of the network and server infrastructures were in darkness and dismay. These were the dark ages of networking, when it was taking ‘ages’ to roll out any configuration change or to notice critical omissions and costly disturbances. But most importantly, manual operator control poses the highest risk of all – human error. Statistically, human factor contributes to 80-90% of all incidents and perfectly correlates across different industries. In aviation transport industry, where regulations demand rigorous investigation of every incident, human factor contribution remains at this steady high mark despite enormous efforts to reduce risk and error. This is because we (humans) are the weak link in the complex automation chains and will always be, unless some dramatic revolutionary improvement is made.
In modern day networks automation evolved from simple scripts and rigid provisioning systems of early days to powerful and flexible powerhouse frameworks, allowing operators to control thousands of devices and choreograph virtual servers in complex mass performances. But surprisingly, the weakest element that plagued early day networks still haunts the critical infrastructures today at unprecedented levels. Limited to impact a single device or system in early days, the lack of knowledge, discipline or coordination can now affect very large portions of the infrastructure in an instance. I have personally dealt with aftermaths of wrong template deployments, rushed network wide software updates and proliferation of corrupt configurations which crippled networks for days and in extreme cases many months.
And many times it occurred to me how convenient it would be to see the real picture behind the countless lines of logs and telemetry in real-time, quite like Neo who was able to see the Matrix through the streaming debug on the monitors of the rebels.
Keep this thought till our next blog in a couple of weeks.
CEO Opt/Net Consulting BV