Cleaning up directory

2024-04-20 10:21:42 -04:00
parent d6afd9f472
commit b4f4565894
17 changed files with 0 additions and 273 deletions
--- a/equinix/api-team/proposals/ruby3-upgrades.org
+++ b/equinix/api-team/proposals/ruby3-upgrades.org
@@ -0,0 +1,64 @@
+#+TITLE: Ruby 3 Upgrades
+#+AUTHOR: Adam Mohammed
+#+DATE: May 10, 2023
+
+
+* Agenda
+- Recap: API deployment architecture
+- Lessons from the Rails 6.0/6.1 upgrade
+- Defining key performance indicators
+
+* Recap: API Deployment
+
+The API deployment consists of:
+- **frontend pods** - 10 Pods dedicated to serving HTTP traffic
+- **worker pods** - 8 pods dedicated to job processing
+- **cron jobs** - various rake tasks executed to perform periodic upkeep necessary for the APIcontext
+
+** Release Candidate Deployment Strategy
+
+This is a form of a canary deployment strategy. This strategy involves
+diverting just a small amount of traffic to the new version, while
+looking for an increased error rate. After some time, we assess how
+the candidate has been performing. If things look bad, then we scale
+back and address the issues. Otherwise we ramp up the amount of
+traffic that the pods see.
+
+Doing things this way allows us to build confidence in the release but
+it does not come without drawbacks. The most important thing to be
+aware of is that we're relying on the k8s service to load balance
+between the two versions of the application. That means that we're not
+doing any tricks to make sure that a customer is only ever hitting a
+single app version.
+
+We accept this risk because issues with HTTP requests are mostly
+confined to the request and each span stamps the rails version that
+processed that portion of the request.
+
+Some HTTP requests are not completed completely at the
+request/response time. For these endpoints, we queue up background
+jobs that the workers eventually process. This means that some
+requests will be processed by the release candidate, and the
+background job will be processed by the older application version.
+
+Because of this, when using this release strategy, we're assuming that
+the two versions are compatible, and can run side-by-side.
+
+
+* Lessons from Previous Rails Upgrades
+
+
+
+
+* Defining key performance indicators
+
+Typically, what I would do (and what I assume Lucas does) is just keep an eye on Rollbar. Rollbar would capture things that are at least fundamentally broken that would cause exceptions or errors in Rails. Additionally, I would keep a broad view on errors by span kind in honeycomb to see if we were seeing a spike associated with the release candidate.
+
+- What we were looking at in the previous releases
+- Error rates by span kind per version
+
+  This helps us know if the error rate for requests is higher in one version or the other. Or if we're failing specifically in proccessing background jobs.
+
+- No surprises in Rollbar
+
+Instead, ideally we'd be tracking some information the system reports that are stable.
--- a/equinix/api-team/proposals/scalable-api.org
+++ b/equinix/api-team/proposals/scalable-api.org
@@ -0,0 +1,23 @@
+#+TITLE: Scalable API
+#+AUTHOR: Adam Mohammed
+
+* Overview
+
+In this document we take a look at the concept of breaking the
+monolith from the start. By that I mean, what do we hope to achieve
+with the breaking the monolith. From there we can identify the
+problems we're trying to solve.
+
+Part of the problem (I have) with the "breaking the monolith" phrase is the
+vision is too lofty. That phrase isn't a vision it's snake-oil. The
+promised land we hope to get to is a place where teams are able to
+focus on delivering business value and new features to customers that
+are meaningful, leverage our existing differentiators, and enable new
+differentiators.
+
+What do we believe is preventing us from delivering business value quickly
+currently? What we identify there is a hypothesis and based on some
+level of intuition, so it's a great start for an attempt to optimize
+the process. It's even better if we can quantify how much effort is
+spent doing these speed-inhibiting activities, so we know we're
+optimizing our bottlenecks.
--- a/equinix/api-team/proposals/session_scheduler.org
+++ b/equinix/api-team/proposals/session_scheduler.org
@@ -0,0 +1,166 @@
+#+TITLE: Session Scheduler
+
+* Overview
+
+For some API requests, the time it would take to serve the request is
+too long for a typical HTTP call. We use ActiveJob from Rails to
+handle these type of background jobs. Typically, instead of servicing
+the whole request before responding back to the client, we'll just
+create a new job and then immediately return.
+
+Sometimes we have jobs that need to be processed in a specific order,
+and this is where the session scheduler comes in. It manages a number
+of queues for workloads, and assigns a job to that queue dynamically.
+
+This document talks about what kind of problems the scheduler is meant
+for, how it is implemented and how you can use it.
+
+* Ordering problems
+
+Often in those background jobs, there are some ordering constraints
+that we have between the jobs. In some networking APIs for example,
+things must happen in some order to achieve the desired state.
+
+The simplest example of this is assigning and unassigning a VLAN to a
+port. You can quickly make these calls to the API in succession, but
+it may take some time for the actual state of the switch to be
+updated. If these jobs are processed in parallel, depending on the
+order in which they finish changes the final state of the port.
+
+If the unassign finshes first, then the final state the user will see
+is that the port is assigned to the VLAN. Otherwise, it'll end up in
+the state without a VLAN assinged.
+
+The best we can do here is make the assumption that we get the
+requests in the order that the customer wanted operations to occur
+in. So, if the assign came in first, we must finish that job before
+processing the unassign.
+
+Our api workers that serve the background jobs currently fetch and
+process jobs as fast as they can with no respect to ordering. When
+ordering is not important, this method works to process jobs quickly.
+
+With our networking example though, it leads to behavior that's hard
+to predict on the customer's end.
+
+*
+
+We have a few constraints for creating a solution to the ordering
+problem. Using the VLANS as an example.
+- Order of jobs must be respected within a project, but total ordering
+  is not important (e.g. Project A's tasks don't need to be ordered
+  with respect to Project B's tasks)*
+- Dynamically spining up consumers and queues isn't the most fun thing
+  in Ruby, but having access to the monolith data is required at this
+  point in time.
+- We need a way to map an arbitrary of projects down to a fixed set of
+  consumers.
+- Although total ordering doesn't matter, we do want to be somewhat
+  fair
+
+
+Let's clarify some terms:
+
+- Total Ordering - All events occur in a specific order (A1 -> B1 ->
+  A2 -> C1 -> B2 -> C2 -> B3)
+- Partial ordering - Some events must occur before others, but the
+  combinations are free (e.g. A1 must occur before A2 which must occur
+  before A3, but [A1,A2,A3] has no
+  relation to B1).
+- Correctness - Jobs ordering constraints are honored.
+- Fairness - If there are jobs A1, A2....An and jobs B1, B2....Bn both
+  are able to get serviced in some reasonable amount of time.
+
+
+* Session scheduler
+
+
+** Queueing and Processing Jobs In Order
+
+For some requests in the Metal API, we aren't able to fully service
+the request in the span of a HTTP request/response time. Some things
+might take several seconds to minutes to complete. We rely on Rails
+Active Job to help us achieve these things as background
+jobs. ActiveJob lets us specify a queue name, which until now, has
+been a static name such as "network".
+
+The API runs a number of workers that are listening on these queues
+with multiple threads, so we can pick up and service the jobs quickly.
+
+This breaks down when we require some jobs to be processed serially or
+in a specific order. This is where the =Worker::SessionScheduler=
+comes in. This scheduler dynamically assigns the queue name for a job
+so that it is accomplished in-order with other related jobs.
+
+
+A typical Rails job looks something like this:
+
+#+begin_src ruby
+  class MyJob < ApplicationJob #1
+    queue_as :network #2
+
+    def perform #3
+      # do stuff
+    end
+  end
+#+end_src
+
+
+1. We can tell the name of the job is =MyJob=
+2. Show the queue that the job will wait in before getting picked up
+3. Perform is the work that the consumer that picks up the job will do
+
+Typically, we'll queue a job to be peformed later within the span of
+an HTTP request by doing something like =MyJob.perform_later=. This
+puts the job on the =network= queue, and the next available worker
+will pull the job off of the queue and then process it.
+
+In the case where we need jobs to be processed in a certain order it
+might look like this:
+
+#+begin_src ruby
+  class MyJob < ApplicationJob
+    queue_as do
+      project = self.arguments.first #2
+      Worker::SessionScheduler.call(session_key: project.id)
+    end
+
+    def perform(project)
+      # do stuff
+    end
+  end
+#+end_src
+
+Now instead of =2= being just a static queue name, it's dynamically
+assigned based on what the scheduler assigns.
+
+The scheduler will use the "session key" to see if there are any other
+jobs queued with the same key, if there are, you get sent to the same
+queue.
+
+If there aren't, you'll get sent to the queue with the least number of
+jobs waiting to be processed, and any subsequent requests with the
+same "session key" will follow.
+
+Just putting jobs in the same queue isn't enough though, because if we
+process the jobs from a queue in parallel, then we end up in a
+situation where we can still have jobs completing out of order. We
+have queues designated to serve this purpose of processing things in
+order. We're currently leveraging a feature on rabbitmq queues that
+lets us guarantee that only one consumer is ever getting the jobs to
+process. We rely on the configuration of that consumer to only use a
+single thread as well to make sure we're not doing things out of
+order.
+
+This can be used to do any set of jobs which need to be ordered,
+though currently we're just using it for Port VLAN management. If you
+do decide to use this, you need to make sure that all the jobs which
+are related share some attribute so you can use that as your "session
+key" when calling into the scheduling service.
+
+The scheduler takes care of the details of managing the queues, so
+once all the jobs for a session are completed, that session will get
+removed and the next time the same key comes in it'll get reallocated
+to the best worker. This allows us to rebalance the queues over time
+so we prevent customers from having longer wait times despite us doing
+things serially.
--- a/equinix/lbaas/lbaas-testing.org
+++ b/equinix/lbaas/lbaas-testing.org
@@ -0,0 +1,181 @@
+#+TITLE: LBaaS Testing
+#+AUTHOR: Adam Mohammed
+#+DATE: August 30, 2023
+
+* API Testing
+:PROPERTIES:
+:header-args:shell:    :session *bash3*
+:header-args:    :results output verbatim
+:END:
+
+#+begin_src shell
+  PS1="> "
+  export PAPI_KEY="my-user-api-key"
+  export PROJECT_ID=7c0d4b1d-4f21-4657-96d4-afe6236e361e
+#+end_src
+
+
+First let's exchange our user's API key for an infratographer JWT.
+#+begin_src shell
+  export INFRA_TOK=$(curl -s -X POST -H"authorization: Bearer $PAPI_KEY" https://iam.metalctrl.io/api-keys/exchange |  jq -M -r '.access_token' )
+#+end_src
+
+#+RESULTS:
+
+
+If all went well, you should see a json object containing the =loadbalancers= key from this block.
+#+begin_src shell
+curl -s -H"Authorization: Bearer $INFRA_TOK" https://lb.metalctrl.io/v1/projects/${PROJECT_ID}/loadbalancers | jq -M
+#+end_src
+
+#+RESULTS:
+#+begin_example
+{
+  "loadbalancers": [
+    {
+      "created_at": "2023-08-30T18:26:19.534351Z",
+      "id": "loadbal-9OhCaBNHUXo_f-gC7YKzW",
+      "ips": [],
+      "name": "test-graphql",
+      "ports": [
+        {
+          "id": "loadprt-8fN2XRnwY8C0SGs_T-zhp",
+          "name": "public-http",
+          "number": 8080
+        }
+      ],
+      "updated_at": "2023-08-30T18:26:19.534351Z"
+    },
+    {
+      "created_at": "2023-08-30T19:55:42.944273Z",
+      "id": "loadbal-pLdVJLcAa3UdbPEmGWwvB",
+      "ips": [],
+      "name": "test-graphql",
+      "ports": [
+        {
+          "id": "loadprt-N8xRozMbxZwtG2yAPk7Wx",
+          "name": "public-http",
+          "number": 8080
+        }
+      ],
+      "updated_at": "2023-08-30T19:55:42.944273Z"
+    }
+  ]
+}
+#+end_example
+
+
+** Creating a LB
+
+Here we'll create an empty LB with our newly exchanged API key.
+#+begin_src shell
+  curl -s \
+       -H"Authorization: Bearer $INFRA_TOK" \
+       -H"content-type: application/json" \
+       -d '{"name": "test-graphql", "location_id": "metlloc-da", "provider_id":"loadpvd-gOB_-byp5ebFo7A3LHv2B"}' \
+       https://lb.metalctrl.io/v1/projects/${PROJECT_ID}/loadbalancers | jq -M
+#+end_src
+
+#+RESULTS:
+:
+: > > > {
+:   "errors": null,
+:   "id": "loadbal-ygZi9cUywLk5_oAoLGMxh"
+: }
+
+
+All we have is an ID now, but eventually we should get an IP back.
+#+begin_src shell
+  RES=$(curl -s \
+       -H"Authorization: Bearer $INFRA_TOK" \
+       https://lb.metalctrl.io/v1/projects/${PROJECT_ID}/loadbalancers | tee )
+  export LOADBALANCER_ID=$(echo $RES  | jq -r '.loadbalancers | sort_by(.created_at) | reverse | .[0].id' )
+  echo $LOADBALANCER_ID
+#+end_src
+
+#+RESULTS:
+:
+: > > > loadbal-ygZi9cUywLk5_oAoLGMxh
+
+
+** Create the backends
+
+The load balancer requires a pool with an associated origin.
+
+#+begin_src shell
+  export POOL_ID=$(curl -s -H"Authorization: Bearer $INFRA_TOK" \
+			-H"content-type: application/json" \
+			-d '{"name": "pool9", "protocol": "tcp"}' \
+			https://lb.metalctrl.io/v1/projects/${PROJECT_ID}/loadbalancers/pools |  jq -r '.id')
+  echo $POOL_ID
+#+end_src
+
+#+RESULTS:
+:
+: > > > loadpol-hC_UY3Woqjfyfw1Tzr5R2
+
+
+Let's create a LB that points to =icanhazip.com= so we can see how we're proxying
+
+#+begin_src shell
+  export TARGET_IP=$(dig +short icanhazip.com | head -1)
+  data=$(jq -M -c -n --arg port_id $POOL_ID --arg target_ip "$TARGET_IP" '{"name": "icanhazip9", "target": $target_ip, "port_id": $port_id, "port_number": 80, "active": true}' | tee )
+  curl -s \
+       -H"Authorization: Bearer $INFRA_TOK" \
+       -H"content-type: application/json" \
+       -d "$data" \
+       https://lb.metalctrl.io/v1/loadbalancers/pools/${POOL_ID}/origins | jq -M
+#+end_src
+
+#+RESULTS:
+:
+: > > > > > {
+:   "errors": null,
+:   "id": "loadogn-zfbMfqtFKeQ75Tul52h4Q"
+: }
+
+
+#+begin_src shell
+  curl -s \
+       -H"Authorization: Bearer $INFRA_TOK" \
+       -H"content-type: application/json" \
+       -d "$(jq -n -M -c -n --arg pool_id $POOL_ID '{"name": "public-http", "number": 8080, "pool_ids": [$pool_id]}')" \
+       https://lb.metalctrl.io/v1/loadbalancers/${LOADBALANCER_ID}/ports | jq -M
+#+end_src
+
+#+RESULTS:
+:
+: > > > {
+:   "errors": null,
+:   "id": "loadprt-IVrZB1sLUfKqdnDULd6Ix"
+: }
+
+** Let's try out the LB now
+
+#+begin_src shell
+  curl -s \
+       -H"Authorization: Bearer $INFRA_TOK" \
+       -H"content-type: application/json" \
+       https://lb.metalctrl.io/v1/loadbalancers/${LOADBALANCER_ID} | jq -M
+
+#+end_src
+
+#+RESULTS:
+#+begin_example
+
+> > {
+  "created_at": "2023-08-30T20:10:59.389392Z",
+  "id": "loadbal-ygZi9cUywLk5_oAoLGMxh",
+  "ips": [],
+  "name": "test-graphql",
+  "ports": [
+    {
+      "id": "loadprt-IVrZB1sLUfKqdnDULd6Ix",
+      "name": "public-http",
+      "number": 8080
+    }
+  ],
+  "provider": null,
+  "updated_at": "2023-08-30T20:10:59.389392Z"
+}
+#+end_example
--- a/equinix/year-end-reviews/2023.org
+++ b/equinix/year-end-reviews/2023.org
@@ -0,0 +1,58 @@
+#+TITLE: Year in review
+#+AUTHOR: Adam Mohammed
+
+
+* January
+- Setting up environments for platform to test auth0 changes against portal
+- Created a golang library to make it easier to build algolia indexes
+  in our applications. Used by bouncer, and quantum to provide nice searchable
+  interfaces on our frontends.
+- Implemented the initial OIDC endpoints for identity-api in LBaaS
+
+* February
+- Wrote helm charts for identity-API
+- Bootstrapped initial identity-api deployment
+- Discussed token format for identity-api
+- Adding algolia indexing to quantum resources
+
+* March
+- Drafted plan for upgrading the monolith from Rails 5 to Rails 6 and Ruby 2 to Ruby 3.
+- Implemented extra o11y where we needed for the upgrade
+- Used gradual rollout strategy to build confidence
+- Upgraded CRDB and documented the process
+
+* April
+- Added testing to exoskeleton - some gin tooling we use for go services
+
+* May
+- Started work on the ResourceOwnerDirectory
+- Maintenance on exoskeleton
+
+* June
+- More ROD work
+- Ruby 3 upgrade
+- Added service to service clients for coupon
+- Testing LBaaS with decuddle
+- Added events to the API
+
+* July
+- Deploy Resource Owner Directory
+
+* August
+- Get ready for LBaaS Launch
+
+* September
+- Implemented queue scheduler
+
+
+* Talks:
+- Session Scheduler
+- Static analysis on Ruby
+- API Auth discussion with using identity-api
+- API monoitoring by thinking about what we actually deliver
+- Deep diving caching issues from #_incent-1564
+- Recorded deployment and monitoring of API
+- Monitoring strategy for the API Rails/Ruby Upgrades
+- CRDB performance troubleshooting
+
+* Docs:
--- a/equinix/year-end-reviews/workday-notes.org
+++ b/equinix/year-end-reviews/workday-notes.org
@@ -0,0 +1,138 @@
+* Goal: Expand our Market - Lay the foundation for product-led growth
+In the Nautilus the biggest responsibility we have is the monolith, and as we've added people to the team, we're starting to add services that are new logic to services outside of the monolith. In order to make this simple, and reduce maintenance burden, I've created exoskeleton and algolyzer, which are go libraries that we can use to develop go services a bit more quickly.
+
+Exoskeleton provides a type-safe routing layer built on top of Gin, and bakes in OTEL so it's easy for us to take our services from local development to production ready.
+
+Algolyzer makes it easier to keep updating algolia indexes happen out of the request span, to keep latency low, while still making sure our UIs are able to be easily searched for relevant objects.
+
+Additionally, I have made a number of improvements to our core infrastructure:
+
+- Improving monitoring of our application to make major upgrades less scary
+- Upgrading from Rails 5 to Rails 6
+- Upgrade from Ruby 2 to Ruby 3
+- Deploying and performing regular maintenance on our CockroachDB cluster
+- Diagnose anycast routing issues with our CRDB deployment that led to unexpectedly high latency, which resulted in changing the network from equal path routing to prefer local.
+
+With these changes we're able to keep moving toward keeping the lights on while allowing us to experiment cheaply with common infra needed for smaller services.
+
+
+* Goal: Build the foundation - A market-leading end-to-end user experience
+
+
+As we started to deliver LBaaS, Infratographer had an entirely
+different opinion on how to manage users and resource ownership, and I
+created a GraphQL service to bridge the gap between infratographer
+concepts and metal concepts, so when a customer uses the product,
+it'll seem familiar. The metal API also emits events that can be
+subscribed to over NATS to get updates for things such as organization
+and project membership changes.
+
+In order to accomplish this it meant close collaboration with the
+identity team to help establish the interfaces and decide on who is
+responsible for what parts. Load balancers can now be provisioned and
+act as if they belong to a project, even though the system of record
+lies completely outside of the Metal API.
+
+VMC-E exposed that we had ordering issues in our VLAN assignments
+portion of the networking stack. I worked with my team mates and SWNet
+to improve the situation. I designed and implemented a queuing
+solution that allows us to queue asynchronous tasks that are order
+dependent on queues with a single consumer. We've already gotten
+feedback from VMC-E and other customers that the correctness issues
+with VLAN assignment have been solved, and we don't need to wait for a
+complete networking overhaul from Orca to fix it. There are more
+opportunities to target issues in our networking stack that suffer
+from ordering issues with this solution.
+
+For federated SSO, I was able to help keep communication between
+Platform Identity, Nautilus and Portals flowing smoothly by
+documenting exactly what was needed to get us in a position to onboard
+our first set of customers using SSO. I used my knowledge of OAuth2 an
+OpenIDConnect and broke down the integration points in a document
+shared between these teams so it was clear what we needed to do. This
+made it easier to commit and deliver within the timeframe we set.
+
+not networking specific
+nano metal
+audit logging
+
+
+
+* Goal: DS FunctionalPriorities - Build, socialize, and execute on plan to improve engineering experience
+
+Throughout this year, I've been circulating ideas in writing and ins
+hared forums more often. Within the nautilus team I did 8 tech-talks
+to share ideas and information with the team and to solicit
+feedback. I also wrote documents for collaborating with other teams
+mainly for LBaaS (specifically around how it integrates with the
+EMAPI) and federated SSO.
+
+- CRDB performance troubleshooting
+
+  I discussed how I determined that anycast routing was not properly
+  weighted, and my methodology for designing tests to diagnose the issue.
+
+- Monitoring strategy for the API Rails/Ruby Upgrades
+
+  Here I discussed how we intended to do these upgrades in a way that
+  built confidence on top of the confidence we got from our test
+  suites by measuring indicators of performance.
+
+- Recorded deployment and monitoring of API
+
+  As we added more people to the team, recording this just made it
+  easier to have something we could point to for an API deployment. We
+  also have this process documented in the repo.
+
+- Deep diving caching issues from #_incent-1564
+
+  We ran into a very hard to reproduce error where a users accessing
+  the same organization with different users were returned the same
+  list of organizations/projects regardless of access. Although, the
+  API prevented actual reads to the objects that the user didn't have
+  proper access to, serving the wrong set of IDs produced unexpected
+  behavior in the Portal. It took a long time to diagnose this, and
+  then I discussed the results with the team.
+
+- API monitoring by thinking about what we actually deliver
+
+  Related to the rails upgrades, being able to accurately measure the
+  health of the monolith requires periodically re-evaluating if we're
+  measuring what matters.
+
+- API Auth discussion with using identity-api
+
+  Discussion on the potential uses for identity-api in a
+  service-to-service context that the API uses quite frequently as we
+  build functionality outside of the API.
+
+- Static analysis on Ruby
+
+  With a dynamically typed language, runtime exceptions are no fun,
+  but some static analysis goes a long way. In this talk I explained
+  how it works at the AST level and how we can use this to enforce
+  conventions that we have adopted in the API. As an action item, I
+  started enabling useful "cops" to prevent common logic errors in
+  ruby.
+
+- Session Scheduler
+
+  Here I discussed the problem and the solution that we implemented
+  to prevent VLANs from being in inconsistent states when assigned and
+  unassigned quickly. The solution we delivered was generic, and
+  solved the problem simply, and this talk was to shine some light on
+  the new tool that the team has to use for ordering problems.
+
+
+
+* Twilio account
+
+
+always assisting the team
+help new joinees to ramp up fast
+participate in interviews
+easy to work with across teams
+clear communication
+able to navigate
+relations with delivery
+not only engineering - product, devrel