Incident Postmortem: Message Broker Failure
Incident Report for Ymonitor
Resolved
On 23 August 2023, a failure in the message broker backend caused measurements to not be processed from 21:43 to 23:41 CET. This resulted in dashboards not being able to refresh data and alerts not being created for measurements that would have triggered them.

Root Cause:

The root cause of the incident was that the message broker backend did not function properly due to some unusual incoming data. The unusual data caused the message broker to crash, which prevented it from processing any measurements.

Impact:

The impact of the incident was that users were unable to view real-time data on the dashboards and no alerts were created for measurements that would have triggered them. This could have resulted in users not being notified of potential problems.

Mitigation:

After carefully investigating, we understood that the data causing the blockage was not needed. We mitigated the issue by discarding the abnormal messages from the broker. After that all pending measurement data was saved to the database and dashboards started to operate normally. We are still investigating to find out the details of the abnormal data that caused this incident and how to prevent this in the future.

We apologize for any inconvenience this incident may have caused. We appreciate your patience and understanding as we work to improve our systems.
Posted Aug 23, 2023 - 21:30 CEST