Bend, don’t break: build in resilience

bend dont break build in resilience

Manufacturers must have computer systems that work. When talking about making systems that keep working, we often talk about “hardening” the system, but that can be misleading. Some things are hard but brittle, in that they can resist light pressure but shatter under a hard shock. What you really want are systems that are resilient—able to absorb shocks and bounce back without breaking. That sounds good, but how do you translate “resilient” into an operational computer system?

Continuous operation

In most circumstances, continuous operation translates to “fail-over” or a redundant component waiting to take over instantly if something goes wrong with the primary component. Depending on your organization’s needs and budget, this can range from an equipment rack with duplicate hardware sitting on the other side of the datacenter to a complete mirror datacenter in a remote location.

In both cases, the key to redundancy is a controller that mirrors transactions to the fail-over system and switches to the redundant system if it detects specific conditions on the primary device. Many executives think that buying redundant hardware is an unnecessary cost. Compared to the cost of catastrophic failure when a critical system goes offline, redundancy can be amazingly cost-effective.

Criminal attacks

Next on the resilience list is defense from criminal attacks. You’ve heard all the stories, but there are still far too many IT executives in manufacturing works who feel that no one would want to attack their operations. Of those who do think about security, hiding resources—“security through obscurity”—is still seen as an effective strategy by some. They’re wrong. Active security that targets not only nuisance attacks such as port probes but also advanced persistent threats and social engineering attacks is absolutely required if your systems are going to be there every time the company needs them.

Planning for failure

Resilience means planning for and being ready for failure.

Planning begins with an effective backup and recovery strategy that includes practicing restore operations on a regular basis. Too many companies have placed their faith in a backup and restore regime that met theoretical needs but failed when needed because small issues were never caught and fixed.

Resilience can include offsite storage of back-up media and an offsite datacenter that can be brought online with backup media. The time to get back online might be measured in hours rather than seconds, but for many critical functions that is adequate and far better than critical functions such as inventory, accounts receivable, or payroll being off-line for days, as in the case of a natural disaster.

Hardening a system might mean trying to ensure that no external event would have an impact on it. Resilience recognizes that the outside world exists and will have an impact but designs systems that can absorb the shock and continue operating. Design for resilience and you have a much better chance at maintaining your manufacturing operations, regardless of what the outside world might bring to bear on your IT systems.

Related Reading: