Webhooks service outage
Incident Report for Labelbox
Postmortem

A connection to the queue that processes events for our webhooks service was dropped by the service and auto-reconnection to the queue failed. In that occurrence, the service continued responding to health checks, but wasn’t processing messages appropriately. This meant the queue was backing up, and webhooks notifications were not being sent. After a service restart and the queue connection was returned, workers started processing the backlog and all notifications were sent out. The following action items are being prioritized to ensure that an outage like this does not occur again, and that should something occur, our monitoring and alerting is in place for the whole lifecycle of the functionality:

  • Robust service and queue monitoring and alerting is configured
  • Health checks at the service level are properly configured
  • Additional logging and metrics are being configured for the service and queue
Posted Dec 09, 2019 - 22:23 UTC

Resolved
Webhooks service outage, causing a backfill of events.
Posted Dec 09, 2019 - 09:30 UTC