Cleaning up directory

This commit is contained in:
2024-04-20 10:21:42 -04:00
parent d6afd9f472
commit b4f4565894
17 changed files with 0 additions and 273 deletions

View File

@@ -0,0 +1,64 @@
#+TITLE: Ruby 3 Upgrades
#+AUTHOR: Adam Mohammed
#+DATE: May 10, 2023
* Agenda
- Recap: API deployment architecture
- Lessons from the Rails 6.0/6.1 upgrade
- Defining key performance indicators
* Recap: API Deployment
The API deployment consists of:
- **frontend pods** - 10 Pods dedicated to serving HTTP traffic
- **worker pods** - 8 pods dedicated to job processing
- **cron jobs** - various rake tasks executed to perform periodic upkeep necessary for the APIcontext
** Release Candidate Deployment Strategy
This is a form of a canary deployment strategy. This strategy involves
diverting just a small amount of traffic to the new version, while
looking for an increased error rate. After some time, we assess how
the candidate has been performing. If things look bad, then we scale
back and address the issues. Otherwise we ramp up the amount of
traffic that the pods see.
Doing things this way allows us to build confidence in the release but
it does not come without drawbacks. The most important thing to be
aware of is that we're relying on the k8s service to load balance
between the two versions of the application. That means that we're not
doing any tricks to make sure that a customer is only ever hitting a
single app version.
We accept this risk because issues with HTTP requests are mostly
confined to the request and each span stamps the rails version that
processed that portion of the request.
Some HTTP requests are not completed completely at the
request/response time. For these endpoints, we queue up background
jobs that the workers eventually process. This means that some
requests will be processed by the release candidate, and the
background job will be processed by the older application version.
Because of this, when using this release strategy, we're assuming that
the two versions are compatible, and can run side-by-side.
* Lessons from Previous Rails Upgrades
* Defining key performance indicators
Typically, what I would do (and what I assume Lucas does) is just keep an eye on Rollbar. Rollbar would capture things that are at least fundamentally broken that would cause exceptions or errors in Rails. Additionally, I would keep a broad view on errors by span kind in honeycomb to see if we were seeing a spike associated with the release candidate.
- What we were looking at in the previous releases
- Error rates by span kind per version
This helps us know if the error rate for requests is higher in one version or the other. Or if we're failing specifically in proccessing background jobs.
- No surprises in Rollbar
Instead, ideally we'd be tracking some information the system reports that are stable.

View File

@@ -0,0 +1,23 @@
#+TITLE: Scalable API
#+AUTHOR: Adam Mohammed
* Overview
In this document we take a look at the concept of breaking the
monolith from the start. By that I mean, what do we hope to achieve
with the breaking the monolith. From there we can identify the
problems we're trying to solve.
Part of the problem (I have) with the "breaking the monolith" phrase is the
vision is too lofty. That phrase isn't a vision it's snake-oil. The
promised land we hope to get to is a place where teams are able to
focus on delivering business value and new features to customers that
are meaningful, leverage our existing differentiators, and enable new
differentiators.
What do we believe is preventing us from delivering business value quickly
currently? What we identify there is a hypothesis and based on some
level of intuition, so it's a great start for an attempt to optimize
the process. It's even better if we can quantify how much effort is
spent doing these speed-inhibiting activities, so we know we're
optimizing our bottlenecks.

View File

@@ -0,0 +1,166 @@
#+TITLE: Session Scheduler
* Overview
For some API requests, the time it would take to serve the request is
too long for a typical HTTP call. We use ActiveJob from Rails to
handle these type of background jobs. Typically, instead of servicing
the whole request before responding back to the client, we'll just
create a new job and then immediately return.
Sometimes we have jobs that need to be processed in a specific order,
and this is where the session scheduler comes in. It manages a number
of queues for workloads, and assigns a job to that queue dynamically.
This document talks about what kind of problems the scheduler is meant
for, how it is implemented and how you can use it.
* Ordering problems
Often in those background jobs, there are some ordering constraints
that we have between the jobs. In some networking APIs for example,
things must happen in some order to achieve the desired state.
The simplest example of this is assigning and unassigning a VLAN to a
port. You can quickly make these calls to the API in succession, but
it may take some time for the actual state of the switch to be
updated. If these jobs are processed in parallel, depending on the
order in which they finish changes the final state of the port.
If the unassign finshes first, then the final state the user will see
is that the port is assigned to the VLAN. Otherwise, it'll end up in
the state without a VLAN assinged.
The best we can do here is make the assumption that we get the
requests in the order that the customer wanted operations to occur
in. So, if the assign came in first, we must finish that job before
processing the unassign.
Our api workers that serve the background jobs currently fetch and
process jobs as fast as they can with no respect to ordering. When
ordering is not important, this method works to process jobs quickly.
With our networking example though, it leads to behavior that's hard
to predict on the customer's end.
*
We have a few constraints for creating a solution to the ordering
problem. Using the VLANS as an example.
- Order of jobs must be respected within a project, but total ordering
is not important (e.g. Project A's tasks don't need to be ordered
with respect to Project B's tasks)*
- Dynamically spining up consumers and queues isn't the most fun thing
in Ruby, but having access to the monolith data is required at this
point in time.
- We need a way to map an arbitrary of projects down to a fixed set of
consumers.
- Although total ordering doesn't matter, we do want to be somewhat
fair
Let's clarify some terms:
- Total Ordering - All events occur in a specific order (A1 -> B1 ->
A2 -> C1 -> B2 -> C2 -> B3)
- Partial ordering - Some events must occur before others, but the
combinations are free (e.g. A1 must occur before A2 which must occur
before A3, but [A1,A2,A3] has no
relation to B1).
- Correctness - Jobs ordering constraints are honored.
- Fairness - If there are jobs A1, A2....An and jobs B1, B2....Bn both
are able to get serviced in some reasonable amount of time.
* Session scheduler
** Queueing and Processing Jobs In Order
For some requests in the Metal API, we aren't able to fully service
the request in the span of a HTTP request/response time. Some things
might take several seconds to minutes to complete. We rely on Rails
Active Job to help us achieve these things as background
jobs. ActiveJob lets us specify a queue name, which until now, has
been a static name such as "network".
The API runs a number of workers that are listening on these queues
with multiple threads, so we can pick up and service the jobs quickly.
This breaks down when we require some jobs to be processed serially or
in a specific order. This is where the =Worker::SessionScheduler=
comes in. This scheduler dynamically assigns the queue name for a job
so that it is accomplished in-order with other related jobs.
A typical Rails job looks something like this:
#+begin_src ruby
class MyJob < ApplicationJob #1
queue_as :network #2
def perform #3
# do stuff
end
end
#+end_src
1. We can tell the name of the job is =MyJob=
2. Show the queue that the job will wait in before getting picked up
3. Perform is the work that the consumer that picks up the job will do
Typically, we'll queue a job to be peformed later within the span of
an HTTP request by doing something like =MyJob.perform_later=. This
puts the job on the =network= queue, and the next available worker
will pull the job off of the queue and then process it.
In the case where we need jobs to be processed in a certain order it
might look like this:
#+begin_src ruby
class MyJob < ApplicationJob
queue_as do
project = self.arguments.first #2
Worker::SessionScheduler.call(session_key: project.id)
end
def perform(project)
# do stuff
end
end
#+end_src
Now instead of =2= being just a static queue name, it's dynamically
assigned based on what the scheduler assigns.
The scheduler will use the "session key" to see if there are any other
jobs queued with the same key, if there are, you get sent to the same
queue.
If there aren't, you'll get sent to the queue with the least number of
jobs waiting to be processed, and any subsequent requests with the
same "session key" will follow.
Just putting jobs in the same queue isn't enough though, because if we
process the jobs from a queue in parallel, then we end up in a
situation where we can still have jobs completing out of order. We
have queues designated to serve this purpose of processing things in
order. We're currently leveraging a feature on rabbitmq queues that
lets us guarantee that only one consumer is ever getting the jobs to
process. We rely on the configuration of that consumer to only use a
single thread as well to make sure we're not doing things out of
order.
This can be used to do any set of jobs which need to be ordered,
though currently we're just using it for Port VLAN management. If you
do decide to use this, you need to make sure that all the jobs which
are related share some attribute so you can use that as your "session
key" when calling into the scheduling service.
The scheduler takes care of the details of managing the queues, so
once all the jobs for a session are completed, that session will get
removed and the next time the same key comes in it'll get reallocated
to the best worker. This allows us to rebalance the queues over time
so we prevent customers from having longer wait times despite us doing
things serially.