On Friday, 1/18/19, the Wistia application experienced an uploading outage between 9:53 - 10:12am ET. During this time, all recently-uploaded medias in the final stage of media creation failed to complete. Customers uploading from their browser were met with an error message, and API users received 401 or 500 HTTP status codes on affected uploads. For the specified time period, any further uploading via either the Wistia app or the upload API was unavailable. Any uploads that failed during the outage required a re-upload once the problem was resolved.
The ~19 minute outage was the result of an internal update from earlier in the week which was delayed in deployment. When this update was pushed to production on the 18th, it contained an outdated configuration which broke the chain of communication between the Wistia app and our uploading servers. This deploy happened at approximately 9:50am ET, and just a few minutes later the uploads began to fail.
This was caught internally by the engineering team at approximately 9:57am ET. The team responded immediately by verifying the issue and within minutes identified the recent deploy as the likely root cause. By 10:08am, incident responders rolled back to an earlier commit which reverted to the correct configuration, and by 10:12am the issue was confirmed to be fixed.
Correcting for this issue in the future will aim to address both the root cause as well as earlier detection of this type of issue. Similar configuration updates will be broken into smaller pieces for more readily identifying contents and possible impact upon deploy. Additionally there will be multiple added layers of internal alerts should a similar service outage occur, so the window between identification and resolution becomes even smaller. QA is also considering more realtime tests for verifying service status immediately upon deploy of a new config.