Cleaning up directory

2024-04-20 10:21:42 -04:00
parent d6afd9f472
commit b4f4565894
17 changed files with 0 additions and 273 deletions
--- a/equinix/api-team/proposals/ruby3-upgrades.org
+++ b/equinix/api-team/proposals/ruby3-upgrades.org
@@ -0,0 +1,64 @@
+#+TITLE: Ruby 3 Upgrades
+#+AUTHOR: Adam Mohammed
+#+DATE: May 10, 2023
+
+
+* Agenda
+- Recap: API deployment architecture
+- Lessons from the Rails 6.0/6.1 upgrade
+- Defining key performance indicators
+
+* Recap: API Deployment
+
+The API deployment consists of:
+- **frontend pods** - 10 Pods dedicated to serving HTTP traffic
+- **worker pods** - 8 pods dedicated to job processing
+- **cron jobs** - various rake tasks executed to perform periodic upkeep necessary for the APIcontext
+
+** Release Candidate Deployment Strategy
+
+This is a form of a canary deployment strategy. This strategy involves
+diverting just a small amount of traffic to the new version, while
+looking for an increased error rate. After some time, we assess how
+the candidate has been performing. If things look bad, then we scale
+back and address the issues. Otherwise we ramp up the amount of
+traffic that the pods see.
+
+Doing things this way allows us to build confidence in the release but
+it does not come without drawbacks. The most important thing to be
+aware of is that we're relying on the k8s service to load balance
+between the two versions of the application. That means that we're not
+doing any tricks to make sure that a customer is only ever hitting a
+single app version.
+
+We accept this risk because issues with HTTP requests are mostly
+confined to the request and each span stamps the rails version that
+processed that portion of the request.
+
+Some HTTP requests are not completed completely at the
+request/response time. For these endpoints, we queue up background
+jobs that the workers eventually process. This means that some
+requests will be processed by the release candidate, and the
+background job will be processed by the older application version.
+
+Because of this, when using this release strategy, we're assuming that
+the two versions are compatible, and can run side-by-side.
+
+
+* Lessons from Previous Rails Upgrades
+
+
+
+
+* Defining key performance indicators
+
+Typically, what I would do (and what I assume Lucas does) is just keep an eye on Rollbar. Rollbar would capture things that are at least fundamentally broken that would cause exceptions or errors in Rails. Additionally, I would keep a broad view on errors by span kind in honeycomb to see if we were seeing a spike associated with the release candidate.
+
+- What we were looking at in the previous releases
+- Error rates by span kind per version
+
+  This helps us know if the error rate for requests is higher in one version or the other. Or if we're failing specifically in proccessing background jobs.
+
+- No surprises in Rollbar
+
+Instead, ideally we'd be tracking some information the system reports that are stable.
--- a/equinix/api-team/proposals/scalable-api.org
+++ b/equinix/api-team/proposals/scalable-api.org
@@ -0,0 +1,23 @@
+#+TITLE: Scalable API
+#+AUTHOR: Adam Mohammed
+
+* Overview
+
+In this document we take a look at the concept of breaking the
+monolith from the start. By that I mean, what do we hope to achieve
+with the breaking the monolith. From there we can identify the
+problems we're trying to solve.
+
+Part of the problem (I have) with the "breaking the monolith" phrase is the
+vision is too lofty. That phrase isn't a vision it's snake-oil. The
+promised land we hope to get to is a place where teams are able to
+focus on delivering business value and new features to customers that
+are meaningful, leverage our existing differentiators, and enable new
+differentiators.
+
+What do we believe is preventing us from delivering business value quickly
+currently? What we identify there is a hypothesis and based on some
+level of intuition, so it's a great start for an attempt to optimize
+the process. It's even better if we can quantify how much effort is
+spent doing these speed-inhibiting activities, so we know we're
+optimizing our bottlenecks.
--- a/equinix/api-team/proposals/session_scheduler.org
+++ b/equinix/api-team/proposals/session_scheduler.org
@@ -0,0 +1,166 @@
+#+TITLE: Session Scheduler
+
+* Overview
+
+For some API requests, the time it would take to serve the request is
+too long for a typical HTTP call. We use ActiveJob from Rails to
+handle these type of background jobs. Typically, instead of servicing
+the whole request before responding back to the client, we'll just
+create a new job and then immediately return.
+
+Sometimes we have jobs that need to be processed in a specific order,
+and this is where the session scheduler comes in. It manages a number
+of queues for workloads, and assigns a job to that queue dynamically.
+
+This document talks about what kind of problems the scheduler is meant
+for, how it is implemented and how you can use it.
+
+* Ordering problems
+
+Often in those background jobs, there are some ordering constraints
+that we have between the jobs. In some networking APIs for example,
+things must happen in some order to achieve the desired state.
+
+The simplest example of this is assigning and unassigning a VLAN to a
+port. You can quickly make these calls to the API in succession, but
+it may take some time for the actual state of the switch to be
+updated. If these jobs are processed in parallel, depending on the
+order in which they finish changes the final state of the port.
+
+If the unassign finshes first, then the final state the user will see
+is that the port is assigned to the VLAN. Otherwise, it'll end up in
+the state without a VLAN assinged.
+
+The best we can do here is make the assumption that we get the
+requests in the order that the customer wanted operations to occur
+in. So, if the assign came in first, we must finish that job before
+processing the unassign.
+
+Our api workers that serve the background jobs currently fetch and
+process jobs as fast as they can with no respect to ordering. When
+ordering is not important, this method works to process jobs quickly.
+
+With our networking example though, it leads to behavior that's hard
+to predict on the customer's end.
+
+*
+
+We have a few constraints for creating a solution to the ordering
+problem. Using the VLANS as an example.
+- Order of jobs must be respected within a project, but total ordering
+  is not important (e.g. Project A's tasks don't need to be ordered
+  with respect to Project B's tasks)*
+- Dynamically spining up consumers and queues isn't the most fun thing
+  in Ruby, but having access to the monolith data is required at this
+  point in time.
+- We need a way to map an arbitrary of projects down to a fixed set of
+  consumers.
+- Although total ordering doesn't matter, we do want to be somewhat
+  fair
+
+
+Let's clarify some terms:
+
+- Total Ordering - All events occur in a specific order (A1 -> B1 ->
+  A2 -> C1 -> B2 -> C2 -> B3)
+- Partial ordering - Some events must occur before others, but the
+  combinations are free (e.g. A1 must occur before A2 which must occur
+  before A3, but [A1,A2,A3] has no
+  relation to B1).
+- Correctness - Jobs ordering constraints are honored.
+- Fairness - If there are jobs A1, A2....An and jobs B1, B2....Bn both
+  are able to get serviced in some reasonable amount of time.
+
+
+* Session scheduler
+
+
+** Queueing and Processing Jobs In Order
+
+For some requests in the Metal API, we aren't able to fully service
+the request in the span of a HTTP request/response time. Some things
+might take several seconds to minutes to complete. We rely on Rails
+Active Job to help us achieve these things as background
+jobs. ActiveJob lets us specify a queue name, which until now, has
+been a static name such as "network".
+
+The API runs a number of workers that are listening on these queues
+with multiple threads, so we can pick up and service the jobs quickly.
+
+This breaks down when we require some jobs to be processed serially or
+in a specific order. This is where the =Worker::SessionScheduler=
+comes in. This scheduler dynamically assigns the queue name for a job
+so that it is accomplished in-order with other related jobs.
+
+
+A typical Rails job looks something like this:
+
+#+begin_src ruby
+  class MyJob < ApplicationJob #1
+    queue_as :network #2
+
+    def perform #3
+      # do stuff
+    end
+  end
+#+end_src
+
+
+1. We can tell the name of the job is =MyJob=
+2. Show the queue that the job will wait in before getting picked up
+3. Perform is the work that the consumer that picks up the job will do
+
+Typically, we'll queue a job to be peformed later within the span of
+an HTTP request by doing something like =MyJob.perform_later=. This
+puts the job on the =network= queue, and the next available worker
+will pull the job off of the queue and then process it.
+
+In the case where we need jobs to be processed in a certain order it
+might look like this:
+
+#+begin_src ruby
+  class MyJob < ApplicationJob
+    queue_as do
+      project = self.arguments.first #2
+      Worker::SessionScheduler.call(session_key: project.id)
+    end
+
+    def perform(project)
+      # do stuff
+    end
+  end
+#+end_src
+
+Now instead of =2= being just a static queue name, it's dynamically
+assigned based on what the scheduler assigns.
+
+The scheduler will use the "session key" to see if there are any other
+jobs queued with the same key, if there are, you get sent to the same
+queue.
+
+If there aren't, you'll get sent to the queue with the least number of
+jobs waiting to be processed, and any subsequent requests with the
+same "session key" will follow.
+
+Just putting jobs in the same queue isn't enough though, because if we
+process the jobs from a queue in parallel, then we end up in a
+situation where we can still have jobs completing out of order. We
+have queues designated to serve this purpose of processing things in
+order. We're currently leveraging a feature on rabbitmq queues that
+lets us guarantee that only one consumer is ever getting the jobs to
+process. We rely on the configuration of that consumer to only use a
+single thread as well to make sure we're not doing things out of
+order.
+
+This can be used to do any set of jobs which need to be ordered,
+though currently we're just using it for Port VLAN management. If you
+do decide to use this, you need to make sure that all the jobs which
+are related share some attribute so you can use that as your "session
+key" when calling into the scheduling service.
+
+The scheduler takes care of the details of managing the queues, so
+once all the jobs for a session are completed, that session will get
+removed and the next time the same key comes in it'll get reallocated
+to the best worker. This allows us to rebalance the queues over time
+so we prevent customers from having longer wait times despite us doing
+things serially.