Workflows not triggering
Incident Report for Alloy
Postmortem

Post Mortem details for the Major Incident affecting Workflows on 28/06/2024

Incident summary: Following the Alloy v2.60 release, some customer Workflows failed to trigger both overnight and during the day. This resulted in some scheduled tasks failing to generate for customers.

Root cause: A bug was found within a new feature that was included in the release; related to memory optimisation. This resulted in some Workflows over a certain size/limit to fail due to being treated as poisonous. These were then removed and deleted from the environment.

The duration of the Incident was approximately 8 hours in total.

Resolution and recovery: We immediately investigated after receiving notification that some Workflows had failed. We quickly identified the new feature that contained the bug, and we then built a critical patch to reverse and back out this change. The patch was then fully tested, and released to all production environments. Following this, we monitored the environment for several hours, with no further issues seen.

Corrective and preventative measures: Following this Incident, we have carried out a full root cause analysis and internal review. While our releases are always rigorously tested, we have identified further improved performance tests that will help us to identify any further problems such as this. We will ensure the tests are included within our standard testing processes going forward, when these tests are required.

Posted Jul 04, 2024 - 18:26 UTC

Resolved
Workflows appear to be processing as normal, following the patch.
A full Post Mortem will be added to the Status Page within 5 working days.
Posted Jun 28, 2024 - 16:01 UTC
Monitoring
We have released a critical patch to Alloy to resolve this issue, v2.60.6 has been released for the Web & Engine.
No downtime is required, however some users may notice slower system speed for the Web while the update processes.
We will monitor the fix closely, and update further as soon as we can.
Posted Jun 28, 2024 - 13:39 UTC
Identified
We have identified the root cause of the issue, and we are now building a fix. This is expected to take about an hour to build and fully test. We will provide further updates as soon as we can.
Posted Jun 28, 2024 - 11:10 UTC
Investigating
We have identified an issue with workflows not triggering and are investigating the issue.

We will provide further updates as soon as possible.
Posted Jun 28, 2024 - 10:09 UTC
This incident affected: UK (Task Executor).