Graph, alert and dashboard irregularities for some US customers

Incident Report for DataSet

Resolved

A majority of the time series have been re-built with Dashboard performance having been restored for most customers and Alert evaluation success rates at pre-incident levels. We're marking the issue resolved and expect 100% return to pre-incident levels in the next 24 hours.
Posted Aug 30, 2022 - 14:16 UTC

Monitoring

The summary service has been restarted and we are beginning to rebuild time series data for all alerts. Once the time series for an alert has been recreated, we will begin evaluating it again. Alerts with longer look back periods will be the last to successfully be evaluated. Loading a dashboard will trigger the time series on that dashboard to be recreated, so that dashboard will initially load more slowly at first. We are continuing to monitor this process.
Posted Aug 29, 2022 - 23:24 UTC

Update

At 15:00 PDT (22:00 Universal Coordinated Time) we will be restarting our summary service, which powers our alerts and speeds up dashboard rendering. The summary service will be unavailable for approximately 10 minutes, after which it will begin rebuilding time series data for all alerts. Each alert will not be evaluated until its time series is rebuilt, so alerts with longer look back periods will be the last to successfully be evaluated. Loading a dashboard will trigger the time series on that dashboard to be recreated, so that dashboard will initially load more slowly at first.
Posted Aug 29, 2022 - 21:38 UTC

Identified

We have identified the issue in our timeseries database and are working on remediation.
Posted Aug 29, 2022 - 17:40 UTC

Investigating

Some customers in the US cluster are reporting unexpected behavior including false alarms, inconsistent graphs, and missing search results. We are currently investigating the issue.
Posted Aug 29, 2022 - 13:47 UTC
This incident affected: Main Site.