On Friday, 9/28/18, the Wistia application experienced significant delays and outages with certain specific services over a period of roughly five hours, between approximately 1pm - 5:30pm EDT. The Wistia API and Soapbox were also severely degraded for a smaller portion of this time, roughly two hours. The cause of this issue was linked to a customer account uploading an excessive number of video files via API over a multiple day period, starting approximately two days before the outage. As these uploads continued, the associated video processing tasks in the Wistia application began to skyrocket, delaying other application activity such as in-app uploads or accessing stats. Soapbox editing was also affected.
Bulk importing of video files via the Upload API is a common and expected operation. However the scale and duration of this particular import reached an unprecedented level, partially due to an error in the client-side script which repeated the uploads ad infinitum until manually ended by the user.
The backlog of tasks began to gradually degrade application performance starting late Thursday, 9/29, and customer emails to the support team regarding problems such as delayed stats or long processing queues helped to flag the issue. By the morning of Friday 9/28, the engineering team identified the massive backlog, and worked to mitigate the issue by increasing the app's capacity for handling this backlog. Unfortunately this increased workload had a reciprocal effect of causing the service degradation to become more apparent, resulting in significant application delays and temporary outages as noted on our status page.
By 5:30pm EDT, just over four hours after the outage began, the backlog of tasks had been emptied, and shortly thereafter app performance returned to normal.
In response to this event, our team is pursuing several angles for more early detection of similar problems. The primary plan to correct for this issue will be introducing multiple layers of internal alerts to catch high volume uploads, with the option of manually revoking API permissions should they exceed a reasonable threshold. This solution is already viable in certain situations, however it does rely on earlier identification of high upload volume. We'll also be streamlining the way background task processing is handled to reduce the impact of a similar situation occurring in the future.