Update notes for 5/10

Ruby upgradessss
2023-05-10 16:06:51 -04:00 · 2023-05-10 16:03:37 -04:00
3 changed files with 116 additions and 8 deletions
--- a/notes.org
+++ b/notes.org
@@ -1,13 +1,6 @@
 * Tasks
-
-** TODO Look at why we'd be getting request bodies for GET ips avail
-  [2023-04-19 Wed]
-** TODO Upgrade CRDB to 22.2.7
 ** TODO Put together POC for micro-caching RAILS
-** TODO Look at "compiling" krakend configs from OpenAPI
-** TODO Meeting with DevRel to talk about Provisioning Failures
-
-
+** DONE Meeting with DevRel to talk about Provisioning Failures
 Chris:
 Cluster api - failed provision
   it shows up with a 403 - moving the project to a new project
@@ -32,3 +25,4 @@ Cluster api - failed provision


   check on rescue and reinstall operations
+** TODO Create a ticket to deal with 403s for provisioning failures
--- a/notes.org_archive
+++ b/notes.org_archive
@@ -215,3 +215,41 @@ cc817f6e-f56f-4cae-91f2-eb1a85049847
 :ARCHIVE_CATEGORY: notes
 :ARCHIVE_TODO: DONE
 :END:
+
+* DONE Audit Spot Market Bids
+:PROPERTIES:
+:ARCHIVE_TIME: 2023-05-10 Wed 16:03
+:ARCHIVE_FILE: ~/notes/org-notes/notes.org
+:ARCHIVE_OLPATH: Tasks
+:ARCHIVE_CATEGORY: notes
+:ARCHIVE_TODO: DONE
+:END:
+
+#+begin_src sql :name max_bids per facility
+SELECT p.slug,  array_agg(f.code), array_agg(cl.max_allowed_bid)
+FROM capacity_levels cl
+JOIN plans p ON cl.plan_id = p.id
+JOIN facilities f ON cl.facility_id = f.id
+JOIN metros m ON f.metro_id = m.id
+GROUP BY p.slug
+ORDER BY p.slug ASC;
+#+end_src
+
+#+begin_src sql :name checking for distinct prices
+
+SELECT cl.plan_id, cl.max_allowed_bid, COUNT(DISTINCT cl.max_allowed_bid)
+FROM capacity_levels cl
+WHERE cl.deleted_at < 'January 1, 1970'
+GROUP BY plan_id, max_allowed_bid;
+#+end_src
+
+Results [[file:capacity_levels_pricing.csv][capacity_levels_pricing.csv]]
+
+* DONE Upgrade CRDB to 22.2.7
+:PROPERTIES:
+:ARCHIVE_TIME: 2023-05-10 Wed 16:03
+:ARCHIVE_FILE: ~/notes/org-notes/notes.org
+:ARCHIVE_OLPATH: Tasks
+:ARCHIVE_CATEGORY: notes
+:ARCHIVE_TODO: DONE
+:END:
--- a/ruby3-upgrades.org
+++ b/ruby3-upgrades.org
@@ -0,0 +1,76 @@
+#+TITLE: Ruby 3 Upgrades
+#+AUTHOR: Adam Mohammed
+#+DATE: May 10, 2023
+
+
+* Agenda
+- Recap: API deployment architecture
+- Lessons from the Rails 6.0/6.1 upgrade
+- Defining key performance indicators
+
+* Recap: API Deployment
+
+The API deployment consists of:
+- **frontend pods** - 10 Pods dedicated to serving HTTP traffic
+- **worker pods** - 8 pods dedicated to job processing
+- **cron jobs** - various rake tasks executed to perform periodic upkeep necessary for the APIcontext
+
+** Release Candidate Deployment Strategy
+
+This is a form of a canary deployment strategy. This strategy involves
+diverting just a small amout of traffic to the new version, while looking
+for an increased error rate. After some time, we assess how the
+candidate has been performing. If things look bad, then we scale back
+and address the issues. Otherwise we ramp up the amount of traffic
+that the pods see.
+
+Doing things this way allows us to build confidence in the release but
+it does not come without drawbacks. The most important thing to be
+aware of is that we're relying on the k8s service to load balance
+between the two versions of the application. That means that we're not
+doing any tricks to make sure that a customer is only ever hitting a
+single app version.
+
+We accept this risk because issues with HTTP requests are mostly
+confined to the request and each span stamps the rails version that
+processed that portion of the request.
+
+Some HTTP requests are not completed completely at the
+request/response time. For these endpoints, we queue up background
+jobs that the workers eventually process. This means that some
+requests will be processed by the release candidate, and the
+background job will be processed by the older application version.
+
+Because of this, when using this release strategy, we're assuming that
+the two versions are compatible, and can run side-by-side.
+
+
+* Lessons from Previous Rails Upgrades
+
+We have telemetry set up to monitor the system as a whole, so
+identifying whether or not something looks like an issue related to
+the upgrade or is unrelated has been left to SMEs intution.
+
+In the rails 5.2->6.0 upgrade we hit a couple issues:
+- Rails 6 jobs were not able to be served with 5 workers
+  - We addressed this before rolling forwards
+- Prometheus-client upgrade meant that all the cron jobs succeeded but
+  failed to report their status.
+
+In the rails 6.1 upgrade we observed a new issue with respect to users
+seeing 404s through the portal, after hitting the =/organizations=
+endpoint.
+- I decided that the scope of the bug was small enough that we were
+  okay to roll forward.
+- Error rates looked largely the same because the symptom that we
+  observed was an increased number of 403s on the Projects Controller
+
+
+* Defining key performance indicators
+
+Typically, what I would do (and what I assume Lucas does) is just keep
+an eye on Rollbar. Rollbar would capture things that are at least
+fundamentally broken that would cause exceptions or errors in
+Rails. Additionally, I would keep a broad view on errors by span kind
+in honeycomb to see if we were seeing a spike associated with the
+release candidate.
Author	SHA1	Message	Date
Adam Mohammed	35975e7fda	Update notes for 5/10	2023-05-10 16:06:51 -04:00
Adam Mohammed	7e576eb71a	Ruby upgradessss	2023-05-10 16:03:37 -04:00