77 lines
3.1 KiB
Org Mode
77 lines
3.1 KiB
Org Mode
#+TITLE: Ruby 3 Upgrades
|
|
#+AUTHOR: Adam Mohammed
|
|
#+DATE: May 10, 2023
|
|
|
|
|
|
* Agenda
|
|
- Recap: API deployment architecture
|
|
- Lessons from the Rails 6.0/6.1 upgrade
|
|
- Defining key performance indicators
|
|
|
|
* Recap: API Deployment
|
|
|
|
The API deployment consists of:
|
|
- **frontend pods** - 10 Pods dedicated to serving HTTP traffic
|
|
- **worker pods** - 8 pods dedicated to job processing
|
|
- **cron jobs** - various rake tasks executed to perform periodic upkeep necessary for the APIcontext
|
|
|
|
** Release Candidate Deployment Strategy
|
|
|
|
This is a form of a canary deployment strategy. This strategy involves
|
|
diverting just a small amout of traffic to the new version, while looking
|
|
for an increased error rate. After some time, we assess how the
|
|
candidate has been performing. If things look bad, then we scale back
|
|
and address the issues. Otherwise we ramp up the amount of traffic
|
|
that the pods see.
|
|
|
|
Doing things this way allows us to build confidence in the release but
|
|
it does not come without drawbacks. The most important thing to be
|
|
aware of is that we're relying on the k8s service to load balance
|
|
between the two versions of the application. That means that we're not
|
|
doing any tricks to make sure that a customer is only ever hitting a
|
|
single app version.
|
|
|
|
We accept this risk because issues with HTTP requests are mostly
|
|
confined to the request and each span stamps the rails version that
|
|
processed that portion of the request.
|
|
|
|
Some HTTP requests are not completed completely at the
|
|
request/response time. For these endpoints, we queue up background
|
|
jobs that the workers eventually process. This means that some
|
|
requests will be processed by the release candidate, and the
|
|
background job will be processed by the older application version.
|
|
|
|
Because of this, when using this release strategy, we're assuming that
|
|
the two versions are compatible, and can run side-by-side.
|
|
|
|
|
|
* Lessons from Previous Rails Upgrades
|
|
|
|
We have telemetry set up to monitor the system as a whole, so
|
|
identifying whether or not something looks like an issue related to
|
|
the upgrade or is unrelated has been left to SMEs intution.
|
|
|
|
In the rails 5.2->6.0 upgrade we hit a couple issues:
|
|
- Rails 6 jobs were not able to be served with 5 workers
|
|
- We addressed this before rolling forwards
|
|
- Prometheus-client upgrade meant that all the cron jobs succeeded but
|
|
failed to report their status.
|
|
|
|
In the rails 6.1 upgrade we observed a new issue with respect to users
|
|
seeing 404s through the portal, after hitting the =/organizations=
|
|
endpoint.
|
|
- I decided that the scope of the bug was small enough that we were
|
|
okay to roll forward.
|
|
- Error rates looked largely the same because the symptom that we
|
|
observed was an increased number of 403s on the Projects Controller
|
|
|
|
|
|
* Defining key performance indicators
|
|
|
|
Typically, what I would do (and what I assume Lucas does) is just keep
|
|
an eye on Rollbar. Rollbar would capture things that are at least
|
|
fundamentally broken that would cause exceptions or errors in
|
|
Rails. Additionally, I would keep a broad view on errors by span kind
|
|
in honeycomb to see if we were seeing a spike associated with the
|
|
release candidate.
|