167 lines
6.4 KiB
Org Mode
167 lines
6.4 KiB
Org Mode
#+TITLE: Session Scheduler
|
|
|
|
* Overview
|
|
|
|
For some API requests, the time it would take to serve the request is
|
|
too long for a typical HTTP call. We use ActiveJob from Rails to
|
|
handle these type of background jobs. Typically, instead of servicing
|
|
the whole request before responding back to the client, we'll just
|
|
create a new job and then immediately return.
|
|
|
|
Sometimes we have jobs that need to be processed in a specific order,
|
|
and this is where the session scheduler comes in. It manages a number
|
|
of queues for workloads, and assigns a job to that queue dynamically.
|
|
|
|
This document talks about what kind of problems the scheduler is meant
|
|
for, how it is implemented and how you can use it.
|
|
|
|
* Ordering problems
|
|
|
|
Often in those background jobs, there are some ordering constraints
|
|
that we have between the jobs. In some networking APIs for example,
|
|
things must happen in some order to achieve the desired state.
|
|
|
|
The simplest example of this is assigning and unassigning a VLAN to a
|
|
port. You can quickly make these calls to the API in succession, but
|
|
it may take some time for the actual state of the switch to be
|
|
updated. If these jobs are processed in parallel, depending on the
|
|
order in which they finish changes the final state of the port.
|
|
|
|
If the unassign finshes first, then the final state the user will see
|
|
is that the port is assigned to the VLAN. Otherwise, it'll end up in
|
|
the state without a VLAN assinged.
|
|
|
|
The best we can do here is make the assumption that we get the
|
|
requests in the order that the customer wanted operations to occur
|
|
in. So, if the assign came in first, we must finish that job before
|
|
processing the unassign.
|
|
|
|
Our api workers that serve the background jobs currently fetch and
|
|
process jobs as fast as they can with no respect to ordering. When
|
|
ordering is not important, this method works to process jobs quickly.
|
|
|
|
With our networking example though, it leads to behavior that's hard
|
|
to predict on the customer's end.
|
|
|
|
*
|
|
|
|
We have a few constraints for creating a solution to the ordering
|
|
problem. Using the VLANS as an example.
|
|
- Order of jobs must be respected within a project, but total ordering
|
|
is not important (e.g. Project A's tasks don't need to be ordered
|
|
with respect to Project B's tasks)*
|
|
- Dynamically spining up consumers and queues isn't the most fun thing
|
|
in Ruby, but having access to the monolith data is required at this
|
|
point in time.
|
|
- We need a way to map an arbitrary of projects down to a fixed set of
|
|
consumers.
|
|
- Although total ordering doesn't matter, we do want to be somewhat
|
|
fair
|
|
|
|
|
|
Let's clarify some terms:
|
|
|
|
- Total Ordering - All events occur in a specific order (A1 -> B1 ->
|
|
A2 -> C1 -> B2 -> C2 -> B3)
|
|
- Partial ordering - Some events must occur before others, but the
|
|
combinations are free (e.g. A1 must occur before A2 which must occur
|
|
before A3, but [A1,A2,A3] has no
|
|
relation to B1).
|
|
- Correctness - Jobs ordering constraints are honored.
|
|
- Fairness - If there are jobs A1, A2....An and jobs B1, B2....Bn both
|
|
are able to get serviced in some reasonable amount of time.
|
|
|
|
|
|
* Session scheduler
|
|
|
|
|
|
** Queueing and Processing Jobs In Order
|
|
|
|
For some requests in the Metal API, we aren't able to fully service
|
|
the request in the span of a HTTP request/response time. Some things
|
|
might take several seconds to minutes to complete. We rely on Rails
|
|
Active Job to help us achieve these things as background
|
|
jobs. ActiveJob lets us specify a queue name, which until now, has
|
|
been a static name such as "network".
|
|
|
|
The API runs a number of workers that are listening on these queues
|
|
with multiple threads, so we can pick up and service the jobs quickly.
|
|
|
|
This breaks down when we require some jobs to be processed serially or
|
|
in a specific order. This is where the =Worker::SessionScheduler=
|
|
comes in. This scheduler dynamically assigns the queue name for a job
|
|
so that it is accomplished in-order with other related jobs.
|
|
|
|
|
|
A typical Rails job looks something like this:
|
|
|
|
#+begin_src ruby
|
|
class MyJob < ApplicationJob #1
|
|
queue_as :network #2
|
|
|
|
def perform #3
|
|
# do stuff
|
|
end
|
|
end
|
|
#+end_src
|
|
|
|
|
|
1. We can tell the name of the job is =MyJob=
|
|
2. Show the queue that the job will wait in before getting picked up
|
|
3. Perform is the work that the consumer that picks up the job will do
|
|
|
|
Typically, we'll queue a job to be peformed later within the span of
|
|
an HTTP request by doing something like =MyJob.perform_later=. This
|
|
puts the job on the =network= queue, and the next available worker
|
|
will pull the job off of the queue and then process it.
|
|
|
|
In the case where we need jobs to be processed in a certain order it
|
|
might look like this:
|
|
|
|
#+begin_src ruby
|
|
class MyJob < ApplicationJob
|
|
queue_as do
|
|
project = self.arguments.first #2
|
|
Worker::SessionScheduler.call(session_key: project.id)
|
|
end
|
|
|
|
def perform(project)
|
|
# do stuff
|
|
end
|
|
end
|
|
#+end_src
|
|
|
|
Now instead of =2= being just a static queue name, it's dynamically
|
|
assigned based on what the scheduler assigns.
|
|
|
|
The scheduler will use the "session key" to see if there are any other
|
|
jobs queued with the same key, if there are, you get sent to the same
|
|
queue.
|
|
|
|
If there aren't, you'll get sent to the queue with the least number of
|
|
jobs waiting to be processed, and any subsequent requests with the
|
|
same "session key" will follow.
|
|
|
|
Just putting jobs in the same queue isn't enough though, because if we
|
|
process the jobs from a queue in parallel, then we end up in a
|
|
situation where we can still have jobs completing out of order. We
|
|
have queues designated to serve this purpose of processing things in
|
|
order. We're currently leveraging a feature on rabbitmq queues that
|
|
lets us guarantee that only one consumer is ever getting the jobs to
|
|
process. We rely on the configuration of that consumer to only use a
|
|
single thread as well to make sure we're not doing things out of
|
|
order.
|
|
|
|
This can be used to do any set of jobs which need to be ordered,
|
|
though currently we're just using it for Port VLAN management. If you
|
|
do decide to use this, you need to make sure that all the jobs which
|
|
are related share some attribute so you can use that as your "session
|
|
key" when calling into the scheduling service.
|
|
|
|
The scheduler takes care of the details of managing the queues, so
|
|
once all the jobs for a session are completed, that session will get
|
|
removed and the next time the same key comes in it'll get reallocated
|
|
to the best worker. This allows us to rebalance the queues over time
|
|
so we prevent customers from having longer wait times despite us doing
|
|
things serially.
|