Measurement data not available
Incident Report for Ymonitor
Postmortem

Dear Customers,

We have experienced some outages on Ymonitor services lately. We had two similar incidents on October 23rd and November 6th. As the incidents happened during the night, they impacted a small number of customers. We apologize for any inconvenience that it might have caused and hereby explain in more detail what happened and how we will prevent issues like these in the future.

Cause

The cause of the outages was problems with our service provider AWS. We use an Amazon MQ service to process measurement data, create alerts and send them in real time. Furthermore, this service is needed to save the measurement data to databases. In both incidents, AWS shut our Amazon MQ service down without prior notice for no apparent reason. We are still communicating with their support teams to understand why these shutdowns occurred. Should we have more information from them we will share it with you.

Impact

During the incident, we saw that measurement data could not be sent to Ymonitor from the sentinels. As a result, no alerts could be created and real time measurement data could not be displayed on dashboards or Yviewer. After AWS put the service back online, sentinels sent the unsent measurement data and kept sending the real time data normally. It took another hour for the system to ingest all recovered data.  

Corrective and preventive actions

After the first incident on October 23rd, we have implemented some hotfixes to prevent measurement data loss. Unfortunately, preventing the loss of real time alerts requires more work. We are working on a permanent solution to recover from an unexpected shutdown of the Amazon MQ service within a reasonable time. We aim to minimize the risk of losing alerts in a similar incident. When we finish the implementation, most probably we will announce an emergency maintenance and deploy it as quickly as possible to minimize service outage.

Should you have more questions please do not hesitate to reach out your consultants.

Kind regards,

Sentia part of Accenture

Posted Nov 07, 2022 - 15:22 CET

Resolved
All systems work normally again.
Posted Nov 07, 2022 - 00:45 CET
Monitoring
The service is put back online by the service provider. Sentinels can send measurements again. If there is any missing data between the start of the incident and now we expect that to be recovered by the sentinel within an hour. If there are any alerts happened during this time period, they might be lost or their notification might arrive late.
Posted Nov 07, 2022 - 00:23 CET
Update
We are continuing to investigate this issue.
Posted Nov 06, 2022 - 23:47 CET
Update
Currently we are experiencing some problems with some services of our provider. At this moment, Ymonitor cannot ingest some measurement data, therefore some alerts cannot be created.
Posted Nov 06, 2022 - 23:11 CET
Investigating
We are currently investigating this issue.
Posted Nov 06, 2022 - 23:08 CET
This incident affected: Ymonitor Dashboards, ymonitor.nl, API, Measurement Data Storage, Alerting, and YGate API.