Workstream onboarding experiencing high latency

Incident Report for workstream

Postmortem

What Happened?

We performed a scheduled maintenance to upgrade of our Amazon Aurora RDS database from cluster version 12 to 13 on 12/23/2023 to start at 3am to improve performance and scale costs. The upgrade was expected to complete before 6am PST, and the system to be up within 15 mins as the suggested required time for upgrade. However, this created unusable latency for our customers due to underestimating the need and time taken to re-run the query plan required to optimize the database after the upgrade.

Why Did it Happen?

AWS does not carry over table stats after a major upgrade which lead to our queries not running with the most optimized query plan.

Action Items

  1. Add an acknowledgement process with customers on Change management process to agree on any out of window requests for scheduled maintenance
  2. Create a gameday migration test using a DR environment to test upgrades with prod-sized datasets.
Posted Jan 10, 2024 - 11:50 PST

Resolved

At 6:00am PST, Workstream Onboarding app experienced high latencies due to an db upgrade that occurred at 3am PST that was expected to last to 5am PST at max. The db upgrade itself was only 15 mins but the upgrade created an unforeseen latency that caused our APIs to backup due to the database optimizations requiring a re-analyze operation after the db upgrade was performed. This action lasted until 6:48am before all operations went back to normal.

We apologize for this inconvenience and will follow up with post-mortem actions to provide earlier notifications and regular updates of any delays in scheduled maintenance.
Posted Dec 23, 2023 - 06:57 PST