diff --git a/ruby3-upgrades.org b/ruby3-upgrades.org new file mode 100644 index 0000000..64ec63c --- /dev/null +++ b/ruby3-upgrades.org @@ -0,0 +1,76 @@ +#+TITLE: Ruby 3 Upgrades +#+AUTHOR: Adam Mohammed +#+DATE: May 10, 2023 + + +* Agenda +- Recap: API deployment architecture +- Lessons from the Rails 6.0/6.1 upgrade +- Defining key performance indicators + +* Recap: API Deployment + +The API deployment consists of: +- **frontend pods** - 10 Pods dedicated to serving HTTP traffic +- **worker pods** - 8 pods dedicated to job processing +- **cron jobs** - various rake tasks executed to perform periodic upkeep necessary for the APIcontext + +** Release Candidate Deployment Strategy + +This is a form of a canary deployment strategy. This strategy involves +diverting just a small amout of traffic to the new version, while looking +for an increased error rate. After some time, we assess how the +candidate has been performing. If things look bad, then we scale back +and address the issues. Otherwise we ramp up the amount of traffic +that the pods see. + +Doing things this way allows us to build confidence in the release but +it does not come without drawbacks. The most important thing to be +aware of is that we're relying on the k8s service to load balance +between the two versions of the application. That means that we're not +doing any tricks to make sure that a customer is only ever hitting a +single app version. + +We accept this risk because issues with HTTP requests are mostly +confined to the request and each span stamps the rails version that +processed that portion of the request. + +Some HTTP requests are not completed completely at the +request/response time. For these endpoints, we queue up background +jobs that the workers eventually process. This means that some +requests will be processed by the release candidate, and the +background job will be processed by the older application version. + +Because of this, when using this release strategy, we're assuming that +the two versions are compatible, and can run side-by-side. + + +* Lessons from Previous Rails Upgrades + +We have telemetry set up to monitor the system as a whole, so +identifying whether or not something looks like an issue related to +the upgrade or is unrelated has been left to SMEs intution. + +In the rails 5.2->6.0 upgrade we hit a couple issues: +- Rails 6 jobs were not able to be served with 5 workers + - We addressed this before rolling forwards +- Prometheus-client upgrade meant that all the cron jobs succeeded but + failed to report their status. + +In the rails 6.1 upgrade we observed a new issue with respect to users +seeing 404s through the portal, after hitting the =/organizations= +endpoint. +- I decided that the scope of the bug was small enough that we were + okay to roll forward. +- Error rates looked largely the same because the symptom that we + observed was an increased number of 403s on the Projects Controller + + +* Defining key performance indicators + +Typically, what I would do (and what I assume Lucas does) is just keep +an eye on Rollbar. Rollbar would capture things that are at least +fundamentally broken that would cause exceptions or errors in +Rails. Additionally, I would keep a broad view on errors by span kind +in honeycomb to see if we were seeing a spike associated with the +release candidate.