Cleaning up directory

This commit is contained in:
2024-04-20 10:21:42 -04:00
parent d6afd9f472
commit b4f4565894
17 changed files with 0 additions and 273 deletions

View File

@@ -0,0 +1,64 @@
#+TITLE: Ruby 3 Upgrades
#+AUTHOR: Adam Mohammed
#+DATE: May 10, 2023
* Agenda
- Recap: API deployment architecture
- Lessons from the Rails 6.0/6.1 upgrade
- Defining key performance indicators
* Recap: API Deployment
The API deployment consists of:
- **frontend pods** - 10 Pods dedicated to serving HTTP traffic
- **worker pods** - 8 pods dedicated to job processing
- **cron jobs** - various rake tasks executed to perform periodic upkeep necessary for the APIcontext
** Release Candidate Deployment Strategy
This is a form of a canary deployment strategy. This strategy involves
diverting just a small amount of traffic to the new version, while
looking for an increased error rate. After some time, we assess how
the candidate has been performing. If things look bad, then we scale
back and address the issues. Otherwise we ramp up the amount of
traffic that the pods see.
Doing things this way allows us to build confidence in the release but
it does not come without drawbacks. The most important thing to be
aware of is that we're relying on the k8s service to load balance
between the two versions of the application. That means that we're not
doing any tricks to make sure that a customer is only ever hitting a
single app version.
We accept this risk because issues with HTTP requests are mostly
confined to the request and each span stamps the rails version that
processed that portion of the request.
Some HTTP requests are not completed completely at the
request/response time. For these endpoints, we queue up background
jobs that the workers eventually process. This means that some
requests will be processed by the release candidate, and the
background job will be processed by the older application version.
Because of this, when using this release strategy, we're assuming that
the two versions are compatible, and can run side-by-side.
* Lessons from Previous Rails Upgrades
* Defining key performance indicators
Typically, what I would do (and what I assume Lucas does) is just keep an eye on Rollbar. Rollbar would capture things that are at least fundamentally broken that would cause exceptions or errors in Rails. Additionally, I would keep a broad view on errors by span kind in honeycomb to see if we were seeing a spike associated with the release candidate.
- What we were looking at in the previous releases
- Error rates by span kind per version
This helps us know if the error rate for requests is higher in one version or the other. Or if we're failing specifically in proccessing background jobs.
- No surprises in Rollbar
Instead, ideally we'd be tracking some information the system reports that are stable.

View File

@@ -0,0 +1,23 @@
#+TITLE: Scalable API
#+AUTHOR: Adam Mohammed
* Overview
In this document we take a look at the concept of breaking the
monolith from the start. By that I mean, what do we hope to achieve
with the breaking the monolith. From there we can identify the
problems we're trying to solve.
Part of the problem (I have) with the "breaking the monolith" phrase is the
vision is too lofty. That phrase isn't a vision it's snake-oil. The
promised land we hope to get to is a place where teams are able to
focus on delivering business value and new features to customers that
are meaningful, leverage our existing differentiators, and enable new
differentiators.
What do we believe is preventing us from delivering business value quickly
currently? What we identify there is a hypothesis and based on some
level of intuition, so it's a great start for an attempt to optimize
the process. It's even better if we can quantify how much effort is
spent doing these speed-inhibiting activities, so we know we're
optimizing our bottlenecks.

View File

@@ -0,0 +1,166 @@
#+TITLE: Session Scheduler
* Overview
For some API requests, the time it would take to serve the request is
too long for a typical HTTP call. We use ActiveJob from Rails to
handle these type of background jobs. Typically, instead of servicing
the whole request before responding back to the client, we'll just
create a new job and then immediately return.
Sometimes we have jobs that need to be processed in a specific order,
and this is where the session scheduler comes in. It manages a number
of queues for workloads, and assigns a job to that queue dynamically.
This document talks about what kind of problems the scheduler is meant
for, how it is implemented and how you can use it.
* Ordering problems
Often in those background jobs, there are some ordering constraints
that we have between the jobs. In some networking APIs for example,
things must happen in some order to achieve the desired state.
The simplest example of this is assigning and unassigning a VLAN to a
port. You can quickly make these calls to the API in succession, but
it may take some time for the actual state of the switch to be
updated. If these jobs are processed in parallel, depending on the
order in which they finish changes the final state of the port.
If the unassign finshes first, then the final state the user will see
is that the port is assigned to the VLAN. Otherwise, it'll end up in
the state without a VLAN assinged.
The best we can do here is make the assumption that we get the
requests in the order that the customer wanted operations to occur
in. So, if the assign came in first, we must finish that job before
processing the unassign.
Our api workers that serve the background jobs currently fetch and
process jobs as fast as they can with no respect to ordering. When
ordering is not important, this method works to process jobs quickly.
With our networking example though, it leads to behavior that's hard
to predict on the customer's end.
*
We have a few constraints for creating a solution to the ordering
problem. Using the VLANS as an example.
- Order of jobs must be respected within a project, but total ordering
is not important (e.g. Project A's tasks don't need to be ordered
with respect to Project B's tasks)*
- Dynamically spining up consumers and queues isn't the most fun thing
in Ruby, but having access to the monolith data is required at this
point in time.
- We need a way to map an arbitrary of projects down to a fixed set of
consumers.
- Although total ordering doesn't matter, we do want to be somewhat
fair
Let's clarify some terms:
- Total Ordering - All events occur in a specific order (A1 -> B1 ->
A2 -> C1 -> B2 -> C2 -> B3)
- Partial ordering - Some events must occur before others, but the
combinations are free (e.g. A1 must occur before A2 which must occur
before A3, but [A1,A2,A3] has no
relation to B1).
- Correctness - Jobs ordering constraints are honored.
- Fairness - If there are jobs A1, A2....An and jobs B1, B2....Bn both
are able to get serviced in some reasonable amount of time.
* Session scheduler
** Queueing and Processing Jobs In Order
For some requests in the Metal API, we aren't able to fully service
the request in the span of a HTTP request/response time. Some things
might take several seconds to minutes to complete. We rely on Rails
Active Job to help us achieve these things as background
jobs. ActiveJob lets us specify a queue name, which until now, has
been a static name such as "network".
The API runs a number of workers that are listening on these queues
with multiple threads, so we can pick up and service the jobs quickly.
This breaks down when we require some jobs to be processed serially or
in a specific order. This is where the =Worker::SessionScheduler=
comes in. This scheduler dynamically assigns the queue name for a job
so that it is accomplished in-order with other related jobs.
A typical Rails job looks something like this:
#+begin_src ruby
class MyJob < ApplicationJob #1
queue_as :network #2
def perform #3
# do stuff
end
end
#+end_src
1. We can tell the name of the job is =MyJob=
2. Show the queue that the job will wait in before getting picked up
3. Perform is the work that the consumer that picks up the job will do
Typically, we'll queue a job to be peformed later within the span of
an HTTP request by doing something like =MyJob.perform_later=. This
puts the job on the =network= queue, and the next available worker
will pull the job off of the queue and then process it.
In the case where we need jobs to be processed in a certain order it
might look like this:
#+begin_src ruby
class MyJob < ApplicationJob
queue_as do
project = self.arguments.first #2
Worker::SessionScheduler.call(session_key: project.id)
end
def perform(project)
# do stuff
end
end
#+end_src
Now instead of =2= being just a static queue name, it's dynamically
assigned based on what the scheduler assigns.
The scheduler will use the "session key" to see if there are any other
jobs queued with the same key, if there are, you get sent to the same
queue.
If there aren't, you'll get sent to the queue with the least number of
jobs waiting to be processed, and any subsequent requests with the
same "session key" will follow.
Just putting jobs in the same queue isn't enough though, because if we
process the jobs from a queue in parallel, then we end up in a
situation where we can still have jobs completing out of order. We
have queues designated to serve this purpose of processing things in
order. We're currently leveraging a feature on rabbitmq queues that
lets us guarantee that only one consumer is ever getting the jobs to
process. We rely on the configuration of that consumer to only use a
single thread as well to make sure we're not doing things out of
order.
This can be used to do any set of jobs which need to be ordered,
though currently we're just using it for Port VLAN management. If you
do decide to use this, you need to make sure that all the jobs which
are related share some attribute so you can use that as your "session
key" when calling into the scheduling service.
The scheduler takes care of the details of managing the queues, so
once all the jobs for a session are completed, that session will get
removed and the next time the same key comes in it'll get reallocated
to the best worker. This allows us to rebalance the queues over time
so we prevent customers from having longer wait times despite us doing
things serially.

View File

@@ -0,0 +1,181 @@
#+TITLE: LBaaS Testing
#+AUTHOR: Adam Mohammed
#+DATE: August 30, 2023
* API Testing
:PROPERTIES:
:header-args:shell: :session *bash3*
:header-args: :results output verbatim
:END:
#+begin_src shell
PS1="> "
export PAPI_KEY="my-user-api-key"
export PROJECT_ID=7c0d4b1d-4f21-4657-96d4-afe6236e361e
#+end_src
First let's exchange our user's API key for an infratographer JWT.
#+begin_src shell
export INFRA_TOK=$(curl -s -X POST -H"authorization: Bearer $PAPI_KEY" https://iam.metalctrl.io/api-keys/exchange | jq -M -r '.access_token' )
#+end_src
#+RESULTS:
If all went well, you should see a json object containing the =loadbalancers= key from this block.
#+begin_src shell
curl -s -H"Authorization: Bearer $INFRA_TOK" https://lb.metalctrl.io/v1/projects/${PROJECT_ID}/loadbalancers | jq -M
#+end_src
#+RESULTS:
#+begin_example
{
"loadbalancers": [
{
"created_at": "2023-08-30T18:26:19.534351Z",
"id": "loadbal-9OhCaBNHUXo_f-gC7YKzW",
"ips": [],
"name": "test-graphql",
"ports": [
{
"id": "loadprt-8fN2XRnwY8C0SGs_T-zhp",
"name": "public-http",
"number": 8080
}
],
"updated_at": "2023-08-30T18:26:19.534351Z"
},
{
"created_at": "2023-08-30T19:55:42.944273Z",
"id": "loadbal-pLdVJLcAa3UdbPEmGWwvB",
"ips": [],
"name": "test-graphql",
"ports": [
{
"id": "loadprt-N8xRozMbxZwtG2yAPk7Wx",
"name": "public-http",
"number": 8080
}
],
"updated_at": "2023-08-30T19:55:42.944273Z"
}
]
}
#+end_example
** Creating a LB
Here we'll create an empty LB with our newly exchanged API key.
#+begin_src shell
curl -s \
-H"Authorization: Bearer $INFRA_TOK" \
-H"content-type: application/json" \
-d '{"name": "test-graphql", "location_id": "metlloc-da", "provider_id":"loadpvd-gOB_-byp5ebFo7A3LHv2B"}' \
https://lb.metalctrl.io/v1/projects/${PROJECT_ID}/loadbalancers | jq -M
#+end_src
#+RESULTS:
:
: > > > {
: "errors": null,
: "id": "loadbal-ygZi9cUywLk5_oAoLGMxh"
: }
All we have is an ID now, but eventually we should get an IP back.
#+begin_src shell
RES=$(curl -s \
-H"Authorization: Bearer $INFRA_TOK" \
https://lb.metalctrl.io/v1/projects/${PROJECT_ID}/loadbalancers | tee )
export LOADBALANCER_ID=$(echo $RES | jq -r '.loadbalancers | sort_by(.created_at) | reverse | .[0].id' )
echo $LOADBALANCER_ID
#+end_src
#+RESULTS:
:
: > > > loadbal-ygZi9cUywLk5_oAoLGMxh
** Create the backends
The load balancer requires a pool with an associated origin.
#+begin_src shell
export POOL_ID=$(curl -s -H"Authorization: Bearer $INFRA_TOK" \
-H"content-type: application/json" \
-d '{"name": "pool9", "protocol": "tcp"}' \
https://lb.metalctrl.io/v1/projects/${PROJECT_ID}/loadbalancers/pools | jq -r '.id')
echo $POOL_ID
#+end_src
#+RESULTS:
:
: > > > loadpol-hC_UY3Woqjfyfw1Tzr5R2
Let's create a LB that points to =icanhazip.com= so we can see how we're proxying
#+begin_src shell
export TARGET_IP=$(dig +short icanhazip.com | head -1)
data=$(jq -M -c -n --arg port_id $POOL_ID --arg target_ip "$TARGET_IP" '{"name": "icanhazip9", "target": $target_ip, "port_id": $port_id, "port_number": 80, "active": true}' | tee )
curl -s \
-H"Authorization: Bearer $INFRA_TOK" \
-H"content-type: application/json" \
-d "$data" \
https://lb.metalctrl.io/v1/loadbalancers/pools/${POOL_ID}/origins | jq -M
#+end_src
#+RESULTS:
:
: > > > > > {
: "errors": null,
: "id": "loadogn-zfbMfqtFKeQ75Tul52h4Q"
: }
#+begin_src shell
curl -s \
-H"Authorization: Bearer $INFRA_TOK" \
-H"content-type: application/json" \
-d "$(jq -n -M -c -n --arg pool_id $POOL_ID '{"name": "public-http", "number": 8080, "pool_ids": [$pool_id]}')" \
https://lb.metalctrl.io/v1/loadbalancers/${LOADBALANCER_ID}/ports | jq -M
#+end_src
#+RESULTS:
:
: > > > {
: "errors": null,
: "id": "loadprt-IVrZB1sLUfKqdnDULd6Ix"
: }
** Let's try out the LB now
#+begin_src shell
curl -s \
-H"Authorization: Bearer $INFRA_TOK" \
-H"content-type: application/json" \
https://lb.metalctrl.io/v1/loadbalancers/${LOADBALANCER_ID} | jq -M
#+end_src
#+RESULTS:
#+begin_example
> > {
"created_at": "2023-08-30T20:10:59.389392Z",
"id": "loadbal-ygZi9cUywLk5_oAoLGMxh",
"ips": [],
"name": "test-graphql",
"ports": [
{
"id": "loadprt-IVrZB1sLUfKqdnDULd6Ix",
"name": "public-http",
"number": 8080
}
],
"provider": null,
"updated_at": "2023-08-30T20:10:59.389392Z"
}
#+end_example

View File

@@ -0,0 +1,58 @@
#+TITLE: Year in review
#+AUTHOR: Adam Mohammed
* January
- Setting up environments for platform to test auth0 changes against portal
- Created a golang library to make it easier to build algolia indexes
in our applications. Used by bouncer, and quantum to provide nice searchable
interfaces on our frontends.
- Implemented the initial OIDC endpoints for identity-api in LBaaS
* February
- Wrote helm charts for identity-API
- Bootstrapped initial identity-api deployment
- Discussed token format for identity-api
- Adding algolia indexing to quantum resources
* March
- Drafted plan for upgrading the monolith from Rails 5 to Rails 6 and Ruby 2 to Ruby 3.
- Implemented extra o11y where we needed for the upgrade
- Used gradual rollout strategy to build confidence
- Upgraded CRDB and documented the process
* April
- Added testing to exoskeleton - some gin tooling we use for go services
* May
- Started work on the ResourceOwnerDirectory
- Maintenance on exoskeleton
* June
- More ROD work
- Ruby 3 upgrade
- Added service to service clients for coupon
- Testing LBaaS with decuddle
- Added events to the API
* July
- Deploy Resource Owner Directory
* August
- Get ready for LBaaS Launch
* September
- Implemented queue scheduler
* Talks:
- Session Scheduler
- Static analysis on Ruby
- API Auth discussion with using identity-api
- API monoitoring by thinking about what we actually deliver
- Deep diving caching issues from #_incent-1564
- Recorded deployment and monitoring of API
- Monitoring strategy for the API Rails/Ruby Upgrades
- CRDB performance troubleshooting
* Docs:

View File

@@ -0,0 +1,138 @@
* Goal: Expand our Market - Lay the foundation for product-led growth
In the Nautilus the biggest responsibility we have is the monolith, and as we've added people to the team, we're starting to add services that are new logic to services outside of the monolith. In order to make this simple, and reduce maintenance burden, I've created exoskeleton and algolyzer, which are go libraries that we can use to develop go services a bit more quickly.
Exoskeleton provides a type-safe routing layer built on top of Gin, and bakes in OTEL so it's easy for us to take our services from local development to production ready.
Algolyzer makes it easier to keep updating algolia indexes happen out of the request span, to keep latency low, while still making sure our UIs are able to be easily searched for relevant objects.
Additionally, I have made a number of improvements to our core infrastructure:
- Improving monitoring of our application to make major upgrades less scary
- Upgrading from Rails 5 to Rails 6
- Upgrade from Ruby 2 to Ruby 3
- Deploying and performing regular maintenance on our CockroachDB cluster
- Diagnose anycast routing issues with our CRDB deployment that led to unexpectedly high latency, which resulted in changing the network from equal path routing to prefer local.
With these changes we're able to keep moving toward keeping the lights on while allowing us to experiment cheaply with common infra needed for smaller services.
* Goal: Build the foundation - A market-leading end-to-end user experience
As we started to deliver LBaaS, Infratographer had an entirely
different opinion on how to manage users and resource ownership, and I
created a GraphQL service to bridge the gap between infratographer
concepts and metal concepts, so when a customer uses the product,
it'll seem familiar. The metal API also emits events that can be
subscribed to over NATS to get updates for things such as organization
and project membership changes.
In order to accomplish this it meant close collaboration with the
identity team to help establish the interfaces and decide on who is
responsible for what parts. Load balancers can now be provisioned and
act as if they belong to a project, even though the system of record
lies completely outside of the Metal API.
VMC-E exposed that we had ordering issues in our VLAN assignments
portion of the networking stack. I worked with my team mates and SWNet
to improve the situation. I designed and implemented a queuing
solution that allows us to queue asynchronous tasks that are order
dependent on queues with a single consumer. We've already gotten
feedback from VMC-E and other customers that the correctness issues
with VLAN assignment have been solved, and we don't need to wait for a
complete networking overhaul from Orca to fix it. There are more
opportunities to target issues in our networking stack that suffer
from ordering issues with this solution.
For federated SSO, I was able to help keep communication between
Platform Identity, Nautilus and Portals flowing smoothly by
documenting exactly what was needed to get us in a position to onboard
our first set of customers using SSO. I used my knowledge of OAuth2 an
OpenIDConnect and broke down the integration points in a document
shared between these teams so it was clear what we needed to do. This
made it easier to commit and deliver within the timeframe we set.
not networking specific
nano metal
audit logging
* Goal: DS FunctionalPriorities - Build, socialize, and execute on plan to improve engineering experience
Throughout this year, I've been circulating ideas in writing and ins
hared forums more often. Within the nautilus team I did 8 tech-talks
to share ideas and information with the team and to solicit
feedback. I also wrote documents for collaborating with other teams
mainly for LBaaS (specifically around how it integrates with the
EMAPI) and federated SSO.
- CRDB performance troubleshooting
I discussed how I determined that anycast routing was not properly
weighted, and my methodology for designing tests to diagnose the issue.
- Monitoring strategy for the API Rails/Ruby Upgrades
Here I discussed how we intended to do these upgrades in a way that
built confidence on top of the confidence we got from our test
suites by measuring indicators of performance.
- Recorded deployment and monitoring of API
As we added more people to the team, recording this just made it
easier to have something we could point to for an API deployment. We
also have this process documented in the repo.
- Deep diving caching issues from #_incent-1564
We ran into a very hard to reproduce error where a users accessing
the same organization with different users were returned the same
list of organizations/projects regardless of access. Although, the
API prevented actual reads to the objects that the user didn't have
proper access to, serving the wrong set of IDs produced unexpected
behavior in the Portal. It took a long time to diagnose this, and
then I discussed the results with the team.
- API monitoring by thinking about what we actually deliver
Related to the rails upgrades, being able to accurately measure the
health of the monolith requires periodically re-evaluating if we're
measuring what matters.
- API Auth discussion with using identity-api
Discussion on the potential uses for identity-api in a
service-to-service context that the API uses quite frequently as we
build functionality outside of the API.
- Static analysis on Ruby
With a dynamically typed language, runtime exceptions are no fun,
but some static analysis goes a long way. In this talk I explained
how it works at the AST level and how we can use this to enforce
conventions that we have adopted in the API. As an action item, I
started enabling useful "cops" to prevent common logic errors in
ruby.
- Session Scheduler
Here I discussed the problem and the solution that we implemented
to prevent VLANs from being in inconsistent states when assigned and
unassigned quickly. The solution we delivered was generic, and
solved the problem simply, and this talk was to shine some light on
the new tool that the team has to use for ordering problems.
* Twilio account
always assisting the team
help new joinees to ramp up fast
participate in interviews
easy to work with across teams
clear communication
able to navigate
relations with delivery
not only engineering - product, devrel