Server

Onyx Network Status

Clear communication throughout service affecting events

View latest announcements
Some systems are experiencing issues

Past Incidents

22nd April 2021

No incidents reported

21st April 2021

Database server issues impacting site availability

We're investigating an issue with one of the Onyx database nodes which may impact site availability - we'll update as soon as we have further information.

Update 12:10 - As of 12:00 all Onyx sites should be back online, many thanks for your patience and apologies for the issues this morning. We're continuing to work on identifying the exact root cause and will confirm this here once known.

20th April 2021

No incidents reported

19th April 2021

No incidents reported

18th April 2021

No incidents reported

17th April 2021

No incidents reported

16th April 2021

Degraded performance for uncached requests

We're investigating an issue which may cause uncached requests to be served at a slower rate than usual. We're working to resolve this and will update as soon as we can.

Update 17:20: We're continuing to investigating this - we're currently in the process of restarting the Onyx storage nodes as part of this which will impact site availability temporarily.

Update 17:51 We've isolated the issue and have implemented a fix. We're continuing to monitor this now and will provide an update once confirmed as resolved. We will publish an RFO for this incident.

Update 18:23 This is confirmed as resolved.

Update 14:00 19/04 - we've now published a full RFO for this incident, which you can view below:

Reason For Outage (RFO): RFO for Onyx performance issues - Friday, 16th of April 2021

Timeline of Events:

At approximately 16:00, our internal monitoring systems flagged issues with high CPU Load Averages across several different compute nodes across separate clusters on the Onyx platform.

Our senior systems administration team, along with the support team leads for the platform, immediately began looking but were unable to determine the exact nature of the problem. At around 16:15, it was observed that several sites on the ONYX platform were experiencing a large increase in waiting ‘D-state’ processes, which suggested a possible issue with the storage layer.

At 16:28, we made contact with the vendor for our storage platform, VAST, to request assistance from their support team with this issue.

At around 17:00, having ruled out several other possible causes, a decision was made to restart the load balancer for the platform, which completed at 17:11 - the load balancer correctly failed over to a secondary server, so no further disruption was caused by this, but equally this didn’t help in terms of a resolution to the underlying issues.

At 17:12, we received confirmation from VAST that they were escalating this issue internally, having also been unable to determine the underlying cause.

At 17:16, the decision was taken to fully shut down and then systematically power back on all of the Onyx platform’s Compute nodes to see if this would offer any further insight as to the possible cause. This was completed at 17:22 and caused a full outage for all sites on the platform for a period between these times. On reboot, the CPU load increased back to the previous high levels within minutes.

At 17:28, one of our systems administration team identified a possible causal link between an earlier maintenance operation on the storage platform and the high CPU load on the Onyx compute nodes. This was confirmed as the probable cause by a VAST engineer, and a member of their team immediately started working on resolving this.

At approximately 17:48, we were advised this work had been completed, and the systems administration team at Krystal confirmed that CPU loads were back under control, and that the issue was resolved.

Cause of Downtime:

An earlier maintenance operation to a trash directory on the storage platform (not directly linked to its use by Onyx) caused an unintended, and accidental, re-activation of a tool within the storage functionality. This then caused it to malfunction (as other scripts related to this tool were still correctly disabled, so it was unable to complete operations successfully) and subsequently started increasing wait times for accessing the storage platform from the Onyx compute nodes, which, in turn, caused the increase in CPU loads on the compute nodes.

Once the cause was identified, the tool was disabled and the storage nodes rebooted (in a way that allowed for failover, so this did not cause a further outage), which then fully resolved the issue.

Summary of Impact:

Aside from the period where we rebooted the compute nodes systematically as part of the debugging process, cached requests were still being served directly from the load balancer, and the storage queries were working, albeit slowly at times, so the impact was relatively low where sites were being cached fully.

Improvements:

Working with our storage vendor, VAST, we will implement a long-term fix to prevent any future possibility of an issue.

Additionally, through constant review of our Change Control processes, we’re working to reduce the time it takes when an issue does infrequently occur to diagnose the cause of this, particularly when this is caused by a separate change preceding the issue.

I would like to offer my unreserved apologies for any inconvenience caused.

Dave Kimberley Director & Chief Operating Officer - Krystal Hosting Ltd