Dashboard Service Downtime Post-mortem on September 24th, 2018

September 27, 2018

•

By The Cypress Team

On Monday, September 24th, the Cypress Dashboard service experienced an approximate downtime of 4 hours from 18:20 UTC to 22:40 UTC. This post is to document and communicate to our users the events that occurred, our investigation, solutions, and the actions we’re taking immediately to mitigate and improve our response to service downtime scenarios.

Events

At 18:20 UTC we started seeing elevated levels of a variety of errors. One repeated error indicated that our application hosting provider was exhibiting “internal platform errors” which we attested to be the root cause of other errors. However, this error only served to distract from the real issue.
Our monitoring service provider (New Relic) was also down during this time, minimizing our visibility. This was unrelated to us, but contributed to a perfect storm of problems.
We detected that background jobs in our queue system were not being completed fast enough, and that we had surpassed the max memory limit of our Redis-based queue system. The overloaded Redis instance was unable to store more jobs which blocked our API servers.
We increased our Redis memory limit and added more worker servers to decongest the queue system.
At 22:40 UTC error levels dropped and our API was responsive again.

Investigation & Solution

The queue system was congested by over 100K analytics event tracking jobs to our analytics provider (Segment). We had also increased the amount of events we tracked in our API.
Queue workers were processing event jobs at a very slow rate, which was causing the backlog of jobs and significantly increasing our Redis memory usage. Our memory usage is typically very low since we delete job data from Redis upon successful job completion.
The issue was multi-faceted:
1. We had not configured the queue system to properly handle the analytics event workload.
2. The Segment API client module batched requests in memory and would not complete event tracking jobs until 10 seconds had passed or 20 requests were batched.
Once we understood the root cause, the solution was simple. We increased the concurrency level of queue worker processes, and were able to quickly process ~100K jobs in around a minute and reduce our Redis memory usage by ~95%.

Improvements & Changes

We greatly appreciate our users’ patience during this downtime. We obviously never want to experience this scenario, and we’re taking the following immediate actions:

Launching a status site to better communicate incidents. During this incident we posted updates on our Twitter (@Cypress_io) page, which is not sufficient.
Improving our alerting systems to better inform us of issues for early prevention and faster response times.
Enhancing our monitoring systems to better report the state of our queue system and API servers.
Expanding geographical distribution of our infrastructure to minimize downtime probability.

The improvements outlined in this post are just the start, as we just recently came out of beta, and are working on building the foundation for the next-generation features of Cypress.

Please feel free to contact us if you have any questions at [email protected].