Dear Customers,
We have experienced some outages on Ymonitor services lately. We had two similar incidents on October 23rd and November 6th. As the incidents happened during the night, they impacted a small number of customers. We apologize for any inconvenience that it might have caused and hereby explain in more detail what happened and how we will prevent issues like these in the future.
Cause
The cause of the outages was problems with our service provider AWS. We use an Amazon MQ service to process measurement data, create alerts and send them in real time. Furthermore, this service is needed to save the measurement data to databases. In both incidents, AWS shut our Amazon MQ service down without prior notice for no apparent reason. We are still communicating with their support teams to understand why these shutdowns occurred. Should we have more information from them we will share it with you.
Impact
During the incident, we saw that measurement data could not be sent to Ymonitor from the sentinels. As a result, no alerts could be created and real time measurement data could not be displayed on dashboards or Yviewer. After AWS put the service back online, sentinels sent the unsent measurement data and kept sending the real time data normally. It took another hour for the system to ingest all recovered data.
Corrective and preventive actions
After the first incident on October 23rd, we have implemented some hotfixes to prevent measurement data loss. Unfortunately, preventing the loss of real time alerts requires more work. We are working on a permanent solution to recover from an unexpected shutdown of the Amazon MQ service within a reasonable time. We aim to minimize the risk of losing alerts in a similar incident. When we finish the implementation, most probably we will announce an emergency maintenance and deploy it as quickly as possible to minimize service outage.
Should you have more questions please do not hesitate to reach out your consultants.
Kind regards,
Sentia part of Accenture