Service degraded for Labeling
Incident Report for Labelbox
Postmortem

Dear Customers,

We wanted to bring to your attention a recent labeling operation service degradation on the Labelbox platform. We will provide details on the cause and remediation as well as assurance on what actions we’re taking to prevent such incidents happening in the future.

Summary: On April 25, 2024, around 9:25 AM UTC, we experienced service degradation with labeling operations. This manifested mostly as general slowness in the product. An investigation was launched and the root cause was determined. Subsequently a fix was implemented and deployed within a few hours by 2:59 PM UTC.

Root Cause:

  • Task queues were overwhelmed around the same time data rows were deleted. Reconciling these deletions for data consistency resulted in a database deadlock.
  • Corrupt data was introduced into our global data processing pipeline causing it to halt and impact all services at Labelbox.

Resolution:

Upon identifying the root cause, fixes were implemented.

  • Data row deletion reconciliation was decoupled from task consumer. Size of messages batch was also reduced to simplify and reduce the size of database query thus greatly lowering the probability of any deadlock.
  • The corrupt data was identified and eliminated. We have implemented a code-patch to prevent similar corrupt data from impacting our global data processing pipeline.

We apologize for this inconvenience. We are here to make Labelbox the most reliable and responsive data-centric AI platform, and will provide support in any way we can for your continued use.

Sincerely,

The Labelbox team

Posted Apr 26, 2024 - 20:36 UTC

Resolved
This incident has been resolved.
Posted Apr 25, 2024 - 14:58 UTC
Update
The issue is resolved, but the system is still catching up with processing the data. During this time, the Data Rows tab may display inaccurate details.
Posted Apr 25, 2024 - 13:22 UTC
Monitoring
The fix has been implemented, and we're observing improvements in the system's health. Our teams are actively monitoring the situation, and we anticipate full resolution within the next 30 minutes.
Posted Apr 25, 2024 - 11:44 UTC
Update
We are continuing to work on a fix for this issue.
Posted Apr 25, 2024 - 10:19 UTC
Update
We are continuing to work on a fix for this issue.
Posted Apr 25, 2024 - 10:17 UTC
Identified
The issue has been identified and we are working on a resolution.
Posted Apr 25, 2024 - 10:15 UTC
This incident affected: Annotate.