At 1:35:11 AM PDT our service that is responsible for delivering webhook notifications encountered an error and the service died. This was resolved at 7:43 AM by further investigation from an engineer. When attempting to restart itself, it continuously ran into the same error and kept failing. During our investigation, we found that it was due to a URL that failed DNS resolution, with that specific error handled incorrectly. When the service restarted itself, it would encounter the same URL (as it had not been updated) and error. As webhooks functionality is newer, our alerting failed for this specific case (the service continuously restarting). None of the existing events were lost, as the queue we pull events from is persisted. Below are action items we have to ensure something like this does not occur again: