Post Mortem details for the Major Incident affecting Workflows on 08/04/2024.
Incident summary: Customer Workflows failed to trigger due to the Task Executor failing.
Root cause: The problem started on Saturday 6th April. The Task Executer failed due to high frequency layer build errors. By Sunday this had escalated to 100% memory usage, which then caused some tasks to fail.
Resolution and recovery: We identified the root cause and resolved this via an emergency patch on Monday 8th April. All Workflows that did not trigger were then requeued for processing. We have a fix being worked on to provide a permanent fix, however the patch on Monday fully resolved the MI.
Corrective and preventative measures: We have created additional alerts that will notify us if this issue occurs again; this will allow us to resolve any memory problems quickly, before the threshold is breached and tasks start to fail.