Delayed async jobs
Incident Report for Wistia
Postmortem

On Friday, 9/28/18, the Wistia application experienced significant delays and outages with certain specific services over a period of roughly five hours, between approximately 1pm - 5:30pm EDT. The Wistia API and Soapbox were also severely degraded for a smaller portion of this time, roughly two hours. The cause of this issue was linked to a customer account uploading an excessive number of video files via API over a multiple day period, starting approximately two days before the outage. As these uploads continued, the associated video processing tasks in the Wistia application began to skyrocket, delaying other application activity such as in-app uploads or accessing stats. Soapbox editing was also affected.

Bulk importing of video files via the Upload API is a common and expected operation. However the scale and duration of this particular import reached an unprecedented level, partially due to an error in the client-side script which repeated the uploads ad infinitum until manually ended by the user.

The backlog of tasks began to gradually degrade application performance starting late Thursday, 9/29, and customer emails to the support team regarding problems such as delayed stats or long processing queues helped to flag the issue. By the morning of Friday 9/28, the engineering team identified the massive backlog, and worked to mitigate the issue by increasing the app's capacity for handling this backlog. Unfortunately this increased workload had a reciprocal effect of causing the service degradation to become more apparent, resulting in significant application delays and temporary outages as noted on our status page.

By 5:30pm EDT, just over four hours after the outage began, the backlog of tasks had been emptied, and shortly thereafter app performance returned to normal.

In response to this event, our team is pursuing several angles for more early detection of similar problems. The primary plan to correct for this issue will be introducing multiple layers of internal alerts to catch high volume uploads, with the option of manually revoking API permissions should they exceed a reasonable threshold. This solution is already viable in certain situations, however it does rely on earlier identification of high upload volume. We'll also be streamlining the way background task processing is handled to reduce the impact of a similar situation occurring in the future.

Posted about 2 months ago. Oct 23, 2018 - 14:29 EDT

Resolved
This incident has been resolved.
Posted 2 months ago. Sep 28, 2018 - 17:53 EDT
Monitoring
The queue has been completely drained. We'll continue to keep monitoring the app.
Posted 3 months ago. Sep 28, 2018 - 17:34 EDT
Update
Due to heavy load induced by working through our backlog, Soapbox users may not be able to save edits to videos, and the Wistia API may be unresponsive during this period.
Posted 3 months ago. Sep 28, 2018 - 15:33 EDT
Update
We are continuing to work on a fix for this issue.
Posted 3 months ago. Sep 28, 2018 - 15:24 EDT
Identified
We have increased our async job capacity significantly to work through the backlog.
Posted 3 months ago. Sep 28, 2018 - 14:45 EDT
Investigating
Jobs that process in the background such as search indexing, updating embeds after they're saved in Customize, and sending emails from the app may be delayed longer than usual. We are currently investigating.
Posted 3 months ago. Sep 28, 2018 - 13:16 EDT
This incident affected: App, Embeds, Stats, and Soapbox.