Webhooks service outage

Incident Report for Labelbox

Postmortem

At 1:35:11 AM PDT our service that is responsible for delivering webhook notifications encountered an error and the service died. This was resolved at 7:43 AM by further investigation from an engineer. When attempting to restart itself, it continuously ran into the same error and kept failing. During our investigation, we found that it was due to a URL that failed DNS resolution, with that specific error handled incorrectly. When the service restarted itself, it would encounter the same URL (as it had not been updated) and error. As webhooks functionality is newer, our alerting failed for this specific case (the service continuously restarting). None of the existing events were lost, as the queue we pull events from is persisted. Below are action items we have to ensure something like this does not occur again:

Better alerting and monitoring of our webhooks service integrated with our on-call engineer.
Better error handling, edge case detection, and automatically deactivate webhooks where a URL is invalid, and alert the user of this occurrence.

Posted Jul 15, 2019 - 21:41 UTC

Resolved

Outage of the webhooks service, notifications were not being delivered.

Posted Jul 15, 2019 - 14:45 UTC