How to minimise the risk of data centre outages

Facility managers should be on the lookout for any efforts to prevent data centre outages, whether they operate a private or co-location data centre. Today’s data centres have more tools than ever to safeguard their infrastructure and maintain their systems to deliver higher server uptime.

However, a data centre is far more likely to operate at its peak performance levels if it adheres to high-quality operational procedures supported by a robust maintenance and support program.

For all of the emphasis on server stability, the “Uptime Institute” has found that human error is one of the most common causes of data centre downtime.

In reality, many of the most high-profile data centre outages suffered by large enterprises in recent years can be linked to carelessness or negligence. For example, thousands of websites were knocked offline for nearly an hour in multiple countries, including Amazon, Twitter and Spotify, thanks to one degraded content delivery network called “Fastly.”

On the other hand, some failures may be quantified entirely in terms of financial loss, such as the TSB outage, which resulted in the firm paying more than £370 million in compensation. Another example is British Airways, where downtime stranded 75,000 passengers, resulting in a £150 million liability for payment.

Other elements, such as cooling and cabling, contribute to the problem. For example, networking is the foundation for many high-performance and high-functioning data centres; therefore, a cable failure would risk the entire data centre.

Checkups are necessary to ensure that cables aren’t compacted too much or have any bends. In addition, poor cable construction with equally poor performance or near-end crosstalk may severely impact a data centre’s efficiency.

Overheating can also cause a data centre to fail. This occurs when equipment becomes too hot and must be shut down to avoid damage. Overheating is caused by various reasons, such as when insufficient cold air enters the cold aisle of a cold-aisle containment system. Another example is when there isn’t enough airflow throughout the cabinets, or the cooling system’s redundancy has been lost.

To reduce the risk of a cooling failure, ensure that all cooling equipment is checked regularly to ensure everything is functioning as expected. In addition, we recommend purchasing an environmental monitoring system that will notify you when temperatures start to go out of control. The most fundamental error can lead to a significant amount of downtime, which is why data centres should do a risk assessment regularly to reduce the possibility of failure.

  • Here are a few of the most frequent blunders:
  • The emergency power-off (EPO) switch is activated.
  • Changing the temperature from Fahrenheit to Centigrade
  • Removing power cords from the equipment
  • Overloading of a circuit
  • Standard process, procedures, or technique is not being followed.

With today’s sophisticated data centre infrastructure management (DCIM) tools, facilities can track the overall health of their equipment and co-located assets. While it may be tough to anticipate every failure, sophisticated algorithms can continuously monitor equipment performance to predict when the hardware is approaching the end of its life cycle or is likely to break down.

When these issues are discovered, data centre employees can prepare to replace outdated equipment without interrupting crucial systems. Furthermore, even the most unexpected outage may be managed without degrading network performance when appropriate redundancies and backups are in place.

To minimise these and other factors, data centre executives should make sure that when it comes to a refurbishment, relocation, or the creation of a new data centre, they are investing in employee training, engagement, and documentation, as well as defining responsibility and executing specific tasks to teams.
Only by employing a set of high-quality operational procedures supported by comprehensive support and a data centre maintenance program reach its maximum potential, with little risk of the outage – keep in mind that ‘prevention is far superior to cure’.

Read more