Post Mortem details for the Major Incident affecting Workflows on 28/06/2024
Incident summary: Following the Alloy v2.60 release, some customer Workflows failed to trigger both overnight and during the day. This resulted in some scheduled tasks failing to generate for customers.
Root cause: A bug was found within a new feature that was included in the release; related to memory optimisation. This resulted in some Workflows over a certain size/limit to fail due to being treated as poisonous. These were then removed and deleted from the environment.
The duration of the Incident was approximately 8 hours in total.
Resolution and recovery: We immediately investigated after receiving notification that some Workflows had failed. We quickly identified the new feature that contained the bug, and we then built a critical patch to reverse and back out this change. The patch was then fully tested, and released to all production environments. Following this, we monitored the environment for several hours, with no further issues seen.
Corrective and preventative measures: Following this Incident, we have carried out a full root cause analysis and internal review. While our releases are always rigorously tested, we have identified further improved performance tests that will help us to identify any further problems such as this. We will ensure the tests are included within our standard testing processes going forward, when these tests are required.