On Tuesday, 12/4/18, the Wistia Application experienced a major outage from 4:04 - 4:20PM, EST. During this time period all Wistia services went down: the Wistia app, uploads, stats, Soapbox, and (uncached) embeds. The majority of embeds accessed within 24 hours of this timeframe would have continued to play normally, as they were cached in our content delivery networks which were unaffected. Approximately 7-8% of total embed loads during this time were affected.
The outage was an unanticipated byproduct of a critical security update to core back-end systems. This update overlapped with earlier maintenance on those same systems, at which time there was an incorrect configuration of the load balancing systems that serve traffic to the application. When the most recent update was complete, the load balancers did not properly connect to the new set of workers, and the outage occurred as a result. The update that took place on 12/4, while it did affect core systems, was not expected to have any customer-facing impact on its own; the consequences of that earlier maintenance created an unanticipated variable.
When the application went down, it was recognized immediately by Wistia employees and reported to engineering. Our infrastructure team identified the load balancer issue by 4:12PM and restored the connection shortly thereafter. By 4:20PM all services were restored to normal, with a total outage time of 16 minutes.
The response to this problem is a mixture of internal process updates for system maintenance and changes to how our infrastructural automation responds to similar situations. A change was made to the configuration mechanism to ensure similar issues default to fail-safe operating modes, ensuring operation of our systems remain uninterrupted. Proactive safeguards have been also implemented to prevent a repeat of these configuration issues. These changes are already live in the application.