Ingestion issues between 07:51 to 16:59 GMT
Incident Report for Scalyr
Incident Summary:
A server responsible for log ingestion failed, causing up to 4.5% of logs ingested between 07:51 and 16:59 GMT to be lost.

Incident Details:
At 07:51 GMT, one of the servers responsible for log ingestion encountered a database issue which caused it to stop accepting log upload requests. During this time, some requests to our ingestion servers returned an error, "serverTooBusy". Unfortunately, this particular failure mode did not trigger any paging alerts. The incident was resolved by removing this server from production.

Followup Actions:
1. Identify changes to prevent this specific database issue from recurring.
2. Create additional alerting rules to ensure prompt attention if an ingestion problem of this nature occurs in the future. We will alert on both direct measures of this specific issue (which related to free disk space), as well as overall measures of system function (ingestion rates and failure rates for each ingestion server).
Posted 10 months ago. Aug 29, 2018 - 21:43 UTC
This incident affected: Main Site.