moar

2023-10-30 15:30:26 -04:00
parent 6f8d6220fa
commit 12cf3967ee
5 changed files with 268 additions and 0 deletions
--- a/design/vrf-bgp-routes.org
+++ b/design/vrf-bgp-routes.org
@@ -0,0 +1,17 @@
 #+TITLE: Implementing endpoint to expose learned routes
 #+AUTHOR: Adam Mohammed
 #+DATE: October 30, 2023
 I asked Tim about what the difference is between
 https://deploy.equinix.com/developers/api/metal#tag/VRFs/operation/getBgpDynamicNeighbors
 and what's available in Trix.
 The first is just data configured by the customer manually.
 Trix exposes the learned routes from peers
 I then asked if it made sense to expose the data as part of:
 https://deploy.equinix.com/developers/api/metal#tag/VRFs/operation/getVrfRoutes
 and the answer I got was probaly
--- a/nanometal/k8s-concept-review.org
+++ b/nanometal/k8s-concept-review.org
@@ -0,0 +1,47 @@
 #+TITLE: K8s concepts review
 #+AUTHOR: Adam Mohammed
 #+DATE: September 18, 2023
 At one of the meetings I brought up how similar nano-metal felt to a
 collection of K8s specifications that make standing up and managing
 K8s clusters easier. In this document I'll cover the following topics
 at a high level: Cluster API, CNI, CCM, and CSI.
 First is the Cluster API, this came about as a means for creating and
 managing kubernetes clusters using kubernetes it self. The cluster API
 allows an operator to use a so-called "management cluster" to create
 other K8s clusters known as "Workload clusters." The cluster API is
 NOT part of the core K8s resources, but is implemented as a set of
 custom resource definitions and controllers to actually carry out the
 desired actions.
 A cluster operator can use the cluster-api to create workload clusters
 by relying on 3 components: bootstrap provider, infrastructure
 provider, and the control plane provider. Nanometal aims at making
 provisioning of bare metal machines extensible and scalable by
 enabling facilities to carry out the desired operations requested by
 the EMAPI. We can think of the EMAPI as the "management cluster" in
 this world.
 What Metal has today, maps well to the infrastructure provider, since
 all the cluster-api has to do is ask for machines with a certain
 configuration and the provider is responsible for making that
 happen. I think for this project a bulk of this work is figuring out
 how we make the infrastructure provider out of our existing
 components, but let's put that aside for right now and consider the
 rest of the components.
 The bootstrap and the control plane providers are concepts that also
 seem important to our goal. We want it to be simple for us to enter a
 new facility and set up the components we need to start provisioning
 hardware. The bootstrap provider, in the cluster-api concepts turns a
 server provisioned with a base OS into an operating K8s node. For us,
 we probably would also want some process which turns any facility or
 existing datacenter, into an equinix metal managed facility.
 Once we know about the facility that we need to manage, the concept of
 the control plane provider maps well with the diagrams from Nanometal
 so far. We'd want some component that installs the required agent and
 supporting components in the facilty so we can start to be able to
 provide metal services there.
--- a/nautilus/sprint-planning.org
+++ b/nautilus/sprint-planning.org
@@ -0,0 +1,23 @@
 * People
 - Emi
 - Hogle
 - Lucas
 - Laurence
 - Ian
 - Thiago
 - Sahil
 - Abhishek
 - Navaneeth
 - Surbhit
 - Nikita
 - Shelby
 ** Remaining work:
 *** Integration testing with server service - Laurence
 *** Review nanometal doc with provisioning team - Ian
 *** Update automation of oapi-codegen and sqlboiler - Abhishek
 *** Serialized Scheduler Garbage collection - hogle
 ** Sprint 19
--- a/vrf-exploration.org
+++ b/vrf-exploration.org
@@ -0,0 +1,43 @@
 #+TITLE:
 Dangerous crew:
 Jarrod
 Ian
 Tim
 Trix is an internal tool
 we're going to use trix as a datasource
 Primary domain is to collect and expose raw telemetry data internally
 Trix itself as a collection tool (not aware of customers)
 relationship data is in monolith but also Tenant API
 VRF GA zoom should take a look at that
 looking glass
 what my routing table looks like
 where it learned those routes from
 is bgp up on this connection
 what does the status look like
 what does the neighbor status
 what is the history of routes learned
 not well structured normally
 "shut them up"
 show learned routes GCP
 vrf/id/routes
 use vni from vrf
 CCIE
--- a/workday-notes.org
+++ b/workday-notes.org
@@ -0,0 +1,138 @@
 * Goal: Expand our Market - Lay the foundation for product-led growth
 In the Nautilus the biggest responsibility we have is the monolith, and as we've added people to the team, we're starting to add services that are new logic to services outside of the monolith. In order to make this simple, and reduce maintenance burden, I've created exoskeleton and algolyzer, which are go libraries that we can use to develop go services a bit more quickly.
 Exoskeleton provides a type-safe routing layer built on top of Gin, and bakes in OTEL so it's easy for us to take our services from local development to production ready.
 Algolyzer makes it easier to keep updating algolia indexes happen out of the request span, to keep latency low, while still making sure our UIs are able to be easily searched for relevant objects.
 Additionally, I have made a number of improvements to our core infrastructure:
 - Improving monitoring of our application to make major upgrades less scary
 - Upgrading from Rails 5 to Rails 6
 - Upgrade from Ruby 2 to Ruby 3
 - Deploying and performing regular maintenance on our CockroachDB cluster
 - Diagnose anycast routing issues with our CRDB deployment that led to unexpectedly high latency, which resulted in changing the network from equal path routing to prefer local.
 With these changes we're able to keep moving toward keeping the lights on while allowing us to experiment cheaply with common infra needed for smaller services.
 * Goal: Build the foundation - A market-leading end-to-end user experience
 As we started to deliver LBaaS, Infratographer had an entirely
 different opinion on how to manage users and resource ownership, and I
 created a GraphQL service to bridge the gap between infratographer
 concepts and metal concepts, so when a customer uses the product,
 it'll seem familiar. The metal API also emits events that can be
 subscribed to over NATS to get updates for things such as organization
 and project membership changes.
 In order to accomplish this it meant close collaboration with the
 identity team to help establish the interfaces and decide on who is
 responsible for what parts. Load balancers can now be provisioned and
 act as if they belong to a project, even though the system of record
 lies completely outside of the Metal API.
 VMC-E exposed that we had ordering issues in our VLAN assignments
 portion of the networking stack. I worked with my team mates and SWNet
 to improve the situation. I designed and implemented a queuing
 solution that allows us to queue asynchronous tasks that are order
 dependent on queues with a single consumer. We've already gotten
 feedback from VMC-E and other customers that the correctness issues
 with VLAN assignment have been solved, and we don't need to wait for a
 complete networking overhaul from Orca to fix it. There are more
 opportunities to target issues in our networking stack that suffer
 from ordering issues with this solution.
 For federated SSO, I was able to help keep communication between
 Platform Identity, Nautilus and Portals flowing smoothly by
 documenting exactly what was needed to get us in a position to onboard
 our first set of customers using SSO. I used my knowledge of OAuth2 an
 OpenIDConnect and broke down the integration points in a document
 shared between these teams so it was clear what we needed to do. This
 made it easier to commit and deliver within the timeframe we set.
 not networking specific
 nano metal
 audit logging
 * Goal: DS FunctionalPriorities - Build, socialize, and execute on plan to improve engineering experience
 Throughout this year, I've been circulating ideas in writing and ins
 hared forums more often. Within the nautilus team I did 8 tech-talks
 to share ideas and information with the team and to solicit
 feedback. I also wrote documents for collaborating with other teams
 mainly for LBaaS (specifically around how it integrates with the
 EMAPI) and federated SSO.
 - CRDB performance troubleshooting
  I discussed how I determined that anycast routing was not properly
  weighted, and my methodology for designing tests to diagnose the issue.
 - Monitoring strategy for the API Rails/Ruby Upgrades
  Here I discussed how we intended to do these upgrades in a way that
  built confidence on top of the confidence we got from our test
  suites by measuring indicators of performance.
 - Recorded deployment and monitoring of API
  As we added more people to the team, recording this just made it
  easier to have something we could point to for an API deployment. We
  also have this process documented in the repo.
 - Deep diving caching issues from #_incent-1564
  We ran into a very hard to reproduce error where a users accessing
  the same organization with different users were returned the same
  list of organizations/projects regardless of access. Although, the
  API prevented actual reads to the objects that the user didn't have
  proper access to, serving the wrong set of IDs produced unexpected
  behavior in the Portal. It took a long time to diagnose this, and
  then I discussed the results with the team.
 - API monitoring by thinking about what we actually deliver
  Related to the rails upgrades, being able to accurately measure the
  health of the monolith requires periodically re-evaluating if we're
  measuring what matters.
 - API Auth discussion with using identity-api
  Discussion on the potential uses for identity-api in a
  service-to-service context that the API uses quite frequently as we
  build functionality outside of the API.
 - Static analysis on Ruby
  With a dynamically typed language, runtime exceptions are no fun,
  but some static analysis goes a long way. In this talk I explained
  how it works at the AST level and how we can use this to enforce
  conventions that we have adopted in the API. As an action item, I
  started enabling useful "cops" to prevent common logic errors in
  ruby.
 - Session Scheduler
  Here I discussed the problem and the solution that we implemented
  to prevent VLANs from being in inconsistent states when assigned and
  unassigned quickly. The solution we delivered was generic, and
  solved the problem simply, and this talk was to shine some light on
  the new tool that the team has to use for ordering problems.
 * Twilio account
 always assisting the team
 help new joinees to ramp up fast
 participate in interviews
 easy to work with across teams
 clear communication
 able to navigate
 relations with delivery
 not only engineering - product, devrel