org-notes/session_scheduler.org

#+TITLE: Session Scheduler

* Overview

For some API requests, the time it would take to serve the request is
too long for a typical HTTP call. We use ActiveJob from Rails to
handle these type of background jobs. Typically, instead of servicing
the whole request before responding back to the client, we'll just
create a new job and then immediately return.

Sometimes we have jobs that need to be processed in a specific order,
and this is where the session scheduler comes in. It manages a number
of queues for workloads, and assigns a job to that queue dynamically.

This document talks about what kind of problems the scheduler is meant
for, how it is implemented and how you can use it.

* Ordering problems

Often in those background jobs, there are some ordering constraints
that we have between the jobs. In some networking APIs for example,
things must happen in some order to achieve the desired state.

The simplest example of this is assigning and unassigning a VLAN to a
port. You can quickly make these calls to the API in succession, but
it may take some time for the actual state of the switch to be
updated. If these jobs are processed in parallel, depending on the
order in which they finish changes the final state of the port.

If the unassign finshes first, then the final state the user will see
is that the port is assigned to the VLAN. Otherwise, it'll end up in
the state without a VLAN assinged.

The best we can do here is make the assumption that we get the
requests in the order that the customer wanted operations to occur
in. So, if the assign came in first, we must finish that job before
processing the unassign.

Our api workers that serve the background jobs currently fetch and
process jobs as fast as they can with no respect to ordering. When
ordering is not important, this method works to process jobs quickly.

With our networking example though, it leads to behavior that's hard
to predict on the customer's end.

*

We have a few constraints for creating a solution to the ordering
problem. Using the VLANS as an example.
- Order of jobs must be respected within a project, but total ordering
  is not important (e.g. Project A's tasks don't need to be ordered
  with respect to Project B's tasks)*
- Dynamically spining up consumers and queues isn't the most fun thing
  in Ruby, but having access to the monolith data is required at this
  point in time.
- We need a way to map an arbitrary of projects down to a fixed set of
  consumers.
- Although total ordering doesn't matter, we do want to be somewhat
  fair


Let's clarify some terms:

- Total Ordering - All events occur in a specific order (A1 -> B1 ->
  A2 -> C1 -> B2 -> C2 -> B3)
- Partial ordering - Some events must occur before others, but the
  combinations are free (e.g. A1 must occur before A2 which must occur
  before A3, but [A1,A2,A3] has no
  relation to B1).
- Correctness - Jobs ordering constraints are honored.
- Fairness - If there are jobs A1, A2....An and jobs B1, B2....Bn both
  are able to get serviced in some reasonable amount of time.


* Session scheduler


** Queueing and Processing Jobs In Order

For some requests in the Metal API, we aren't able to fully service
the request in the span of a HTTP request/response time. Some things
might take several seconds to minutes to complete. We rely on Rails
Active Job to help us achieve these things as background
jobs. ActiveJob lets us specify a queue name, which until now, has
been a static name such as "network".

The API runs a number of workers that are listening on these queues
with multiple threads, so we can pick up and service the jobs quickly.

This breaks down when we require some jobs to be processed serially or
in a specific order. This is where the =Worker::SessionScheduler=
comes in. This scheduler dynamically assigns the queue name for a job
so that it is accomplished in-order with other related jobs.


A typical Rails job looks something like this:

#+begin_src ruby
  class MyJob < ApplicationJob #1
    queue_as :network #2

    def perform #3
      # do stuff
    end
  end
#+end_src


1. We can tell the name of the job is =MyJob=
2. Show the queue that the job will wait in before getting picked up
3. Perform is the work that the consumer that picks up the job will do

Typically, we'll queue a job to be peformed later within the span of
an HTTP request by doing something like =MyJob.perform_later=. This
puts the job on the =network= queue, and the next available worker
will pull the job off of the queue and then process it.

In the case where we need jobs to be processed in a certain order it
might look like this:

#+begin_src ruby
  class MyJob < ApplicationJob
    queue_as do
      project = self.arguments.first #2
      Worker::SessionScheduler.call(session_key: project.id)
    end

    def perform(project)
      # do stuff
    end
  end
#+end_src

Now instead of =2= being just a static queue name, it's dynamically
assigned based on what the scheduler assigns.

The scheduler will use the "session key" to see if there are any other
jobs queued with the same key, if there are, you get sent to the same
queue.

If there aren't, you'll get sent to the queue with the least number of
jobs waiting to be processed, and any subsequent requests with the
same "session key" will follow.

Just putting jobs in the same queue isn't enough though, because if we
process the jobs from a queue in parallel, then we end up in a
situation where we can still have jobs completing out of order. We
have queues designated to serve this purpose of processing things in
order. We're currently leveraging a feature on rabbitmq queues that
lets us guarantee that only one consumer is ever getting the jobs to
process. We rely on the configuration of that consumer to only use a
single thread as well to make sure we're not doing things out of
order.

This can be used to do any set of jobs which need to be ordered,
though currently we're just using it for Port VLAN management. If you
do decide to use this, you need to make sure that all the jobs which
are related share some attribute so you can use that as your "session
key" when calling into the scheduling service.

The scheduler takes care of the details of managing the queues, so
once all the jobs for a session are completed, that session will get
removed and the next time the same key comes in it'll get reallocated
to the best worker. This allows us to rebalance the queues over time
so we prevent customers from having longer wait times despite us doing
things serially.