OT Incident Response: Is it Mission Impossible?
What Should You do in the Event of a Cybersecurity Incident in an Industrial Environment?
The inevitability of cyberattacks in this day and age means that industrial facilities all around the world are re-prioritizing their investments in digital industrial cyber security. As priorities shift, the biggest need is to improve spending on strategies and tools for incident detection and response, relative to traditional tactics of protection and prevention.
In other words, it’s well understood that it is no longer enough to build a typical network security perimeter and hope that it keeps attackers out. Procedures must be designed to deal with attacks, and be ready to mitigate and repair any damage once someone has got in.
That message seems to be getting through in the world of enterprise IT, where the scale of the problem has become clear thanks to disclosure laws and public awareness of new regulations. However, when it comes to operational technology (OT), many organisations are still struggling to adjust, and the old 80/20 spending rule – that 80% of spend is on prevent and protect, and just 20% on detect and respond – still holds true.
There are signs that this will change. For those providers of critical infrastructure in oil, water, manufacturing, power, gas, transport and healthcare, failure to adequately prepare an incident could mean falling foul of new regulations, such as the European Union’s (EU) Directive on Security of Network and Information Systems (NIS Directive). Compliance is just one motivating factor, however. A methodical and well-structured plan to deal with incidents will improve security and help to prevent financial or reputational damage too.
One key challenge for incident detection and response in industrial OT environments is that there’s a far lower level of homogeneity than within the typical IT world. A Windows malware can run rampant and be devastating, but the availability of troubleshooting skills is relatively high.
A successful attack on OT, meanwhile, could impact numerous different systems from different vendors, and understanding the appropriate response in the face of this complexity requires highly specialised skills and involvement from various parties, including engineering, vendors, system integrators and more. As OT systems tend not to be configured for the comprehensive levels of logging or hold data in long term storage, the forensic toolkit for isolating problems is very different to the world of IT, and sometimes impossible.
Adding to the difficulty is the fact that OT environments are at risk of both generalized malware and very sophisticated, specific threats – and failure to take appropriate action could lead to devastating results that could impact physical processes. The Triton malware, for example, was designed – at great apparent expense – to target a Safety Instrument System (SIS) of the type used in power stations around the world. If it hadn’t inadvertently triggered that SIS, it might have remained undetected until it delivered its final payload, and while we may never know what the goal of its creators was, it certainly had the potential to cause mass disruption in a very dangerous environment, potentially resulting in loss of life.
Triton and other OT malwares such as Stuxnet, Duqu and Shamoon all infected systems without being detected for some months. Significant state-owned forces were mobilized in order to deal with and investigate the events. It would be naïve to assume that there are no other threats like these waiting to be found.
So, what should you do in the event of a cybersecurity incident in an industrial environment? Best practice guidelines for detect and respond policies follow a well-established seven step process: prepare, identify, contain, eradicate, restore, learn, test and repeat.
Firstly, the key word in an incident plan is not “incident”. Preparation is everything, and this means a thorough risk assessment needs to be in place which addresses all points from staff training to developing contact lists for important personnel in the event of an incident. It means more than knowing who to contact, it must take into account any potential difficulty around how to contact them.
Contingencies for an incident which knocks out communications, creates a hazardous environment or takes place in a remote site – such as an oil rig – must be in place and regularly updated. Ensuring external parties are part of the planning stage is vital too, as is keeping up-to-date with the contractual obligations.
The second step involves breach identification, and it’s here that many organizations struggle. The ability to spot unusual behaviours and correctly classify them is vital to taking appropriate action. Many of the penetration tests that we conduct are successful, indicating that much work needs to be done in this area. The good news is that tools do exist that provide heuristic monitoring and early warnings that an attack may be underway.
Once an issue is confirmed, it’s important to understand the nature of an event and its potential to cause damage. Filtering out false positives requires experience and high degrees of technical skill.
Next comes containment, and this again requires protocols which lay out appropriate courses of action which can be vital to keeping heads cool during a crisis. Over-reaction could be just as damaging to operations as under-reaction. Can the threat be contained simply by disconnecting one network host, or isolating a section of the production line? Is there a plan in place for segregating the OT network if malware is discovered on the corporate network? The right strategy will prevent unnecessary downtime, and also make forensic investigation easier if data about system states can be preserved.
It also goes without saying that reputational damage control is important. Being able to communicate with knowledge and transparency in a time of crisis is vital to avoid compounding the problem.
Eradication and restoration
The fourth and fifth steps involve eradicating the threat and bringing the environment back online as quickly as possible, in an ideal world using a well-documented process for restoring from trustworthy “golden image” backups.
Backup and restoration, however, does present certain difficulties. One key issue is the challenge of regularly testing this ability and the backups themselves. Stopping a production line for comprehensive drills is difficult, while maintaining a replica environment for testing is prohibitively expensive for most. Technologies like virtualisation, which has been widely adopted by the industry, can provide the required flexibility and assurance.
Learning and reiterating
And finally, steps six and seven emphasize the need to document and learn from every event, in order to identify weaknesses and prevent recurrence. Then fine tune and test your processes and train your staff with attack simulation, drills and games, and then do it again and again.
The key word that we keep coming back to, however, is skills. An incident response plan can only be as effective as the people who create it and put it into action, and these skills need to be cultivated against a backdrop of chronic underinvestment in the past. Organizations will have to utilize the services of external entities who are able to bring cross-sector OT experience to the challenges – and it’s better to anticipate problems as partners than engage as a post-attack victim.