Streamable and non-streamable exports are experiencing increased latencies
Incident Report for Labelbox
Postmortem

Dear Customers,

We wanted to bring to your attention a recent latency issue with export functionality on the Labelbox platform. We’ll provide details on the cause and remediation as well as assurance on what actions we’re taking to prevent such incidents happening in the future.

Summary: On April 15, 2024, around 14:00 UTC, we experienced latency issues with export functionality. An investigation was launched and the root cause was determined. Subsequently a fix was implemented and deployed within hours. There were some sporadic occurrences of login issues while the increased export latency mitigation was in progress.

Root Cause: The root cause of the incident was due to high resource usage on the primary database and a resultant backlog in internal systems due to a sudden surge in exports, even as systems scaled to meet the increased load.

Resolution: Upon identifying the root cause, a fix was implemented that directed export traffic to a replica thereby reducing load on the primary database. We have also updated the batch processing logic for exports to reduce the probability of overload. Systems were monitored to confirm problem mitigation.

We have also identified and scheduled the following additional corrective actions to prevent the occurrence of these types of issues in the future.

  • Enhancing the concurrent message processing logic.
  • Improving the retry handling mechanism to prevent a backlog of unprocessed messages.

We apologize for this inconvenience. We are here to make Labelbox the most reliable and responsive data-centric AI platform, and will provide support in any way we can for your continued use.

Sincerely,

The Labelbox team

Posted Apr 16, 2024 - 22:27 UTC

Resolved
Exports latency is resolved.
We'll be posting a post-mortem analysis of the incident within a week.
Posted Apr 15, 2024 - 19:58 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 15, 2024 - 16:10 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 15, 2024 - 14:04 UTC
This incident affected: Platform (common services).