From 12cf3967ee9b6c211f163a1a5342dddc89633413 Mon Sep 17 00:00:00 2001 From: Adam Mohammed Date: Mon, 30 Oct 2023 15:30:26 -0400 Subject: [PATCH] moar --- design/vrf-bgp-routes.org | 17 ++++ nanometal/k8s-concept-review.org | 47 +++++++++++ nautilus/sprint-planning.org | 23 ++++++ vrf-exploration.org | 43 ++++++++++ workday-notes.org | 138 +++++++++++++++++++++++++++++++ 5 files changed, 268 insertions(+) create mode 100644 design/vrf-bgp-routes.org create mode 100644 nanometal/k8s-concept-review.org create mode 100644 nautilus/sprint-planning.org create mode 100644 vrf-exploration.org create mode 100644 workday-notes.org diff --git a/design/vrf-bgp-routes.org b/design/vrf-bgp-routes.org new file mode 100644 index 0000000..efc8b7d --- /dev/null +++ b/design/vrf-bgp-routes.org @@ -0,0 +1,17 @@ +#+TITLE: Implementing endpoint to expose learned routes +#+AUTHOR: Adam Mohammed +#+DATE: October 30, 2023 + +I asked Tim about what the difference is between + +https://deploy.equinix.com/developers/api/metal#tag/VRFs/operation/getBgpDynamicNeighbors +and what's available in Trix. + +The first is just data configured by the customer manually. + +Trix exposes the learned routes from peers + +I then asked if it made sense to expose the data as part of: +https://deploy.equinix.com/developers/api/metal#tag/VRFs/operation/getVrfRoutes + +and the answer I got was probaly diff --git a/nanometal/k8s-concept-review.org b/nanometal/k8s-concept-review.org new file mode 100644 index 0000000..23994a7 --- /dev/null +++ b/nanometal/k8s-concept-review.org @@ -0,0 +1,47 @@ +#+TITLE: K8s concepts review +#+AUTHOR: Adam Mohammed +#+DATE: September 18, 2023 + + +At one of the meetings I brought up how similar nano-metal felt to a +collection of K8s specifications that make standing up and managing +K8s clusters easier. In this document I'll cover the following topics +at a high level: Cluster API, CNI, CCM, and CSI. + +First is the Cluster API, this came about as a means for creating and +managing kubernetes clusters using kubernetes it self. The cluster API +allows an operator to use a so-called "management cluster" to create +other K8s clusters known as "Workload clusters." The cluster API is +NOT part of the core K8s resources, but is implemented as a set of +custom resource definitions and controllers to actually carry out the +desired actions. + +A cluster operator can use the cluster-api to create workload clusters +by relying on 3 components: bootstrap provider, infrastructure +provider, and the control plane provider. Nanometal aims at making +provisioning of bare metal machines extensible and scalable by +enabling facilities to carry out the desired operations requested by +the EMAPI. We can think of the EMAPI as the "management cluster" in +this world. + +What Metal has today, maps well to the infrastructure provider, since +all the cluster-api has to do is ask for machines with a certain +configuration and the provider is responsible for making that +happen. I think for this project a bulk of this work is figuring out +how we make the infrastructure provider out of our existing +components, but let's put that aside for right now and consider the +rest of the components. + +The bootstrap and the control plane providers are concepts that also +seem important to our goal. We want it to be simple for us to enter a +new facility and set up the components we need to start provisioning +hardware. The bootstrap provider, in the cluster-api concepts turns a +server provisioned with a base OS into an operating K8s node. For us, +we probably would also want some process which turns any facility or +existing datacenter, into an equinix metal managed facility. + +Once we know about the facility that we need to manage, the concept of +the control plane provider maps well with the diagrams from Nanometal +so far. We'd want some component that installs the required agent and +supporting components in the facilty so we can start to be able to +provide metal services there. diff --git a/nautilus/sprint-planning.org b/nautilus/sprint-planning.org new file mode 100644 index 0000000..9a8aa56 --- /dev/null +++ b/nautilus/sprint-planning.org @@ -0,0 +1,23 @@ +* People +- Emi +- Hogle +- Lucas +- Laurence +- Ian +- Thiago +- Sahil +- Abhishek +- Navaneeth +- Surbhit +- Nikita +- Shelby + + +** Remaining work: +*** Integration testing with server service - Laurence +*** Review nanometal doc with provisioning team - Ian +*** Update automation of oapi-codegen and sqlboiler - Abhishek +*** Serialized Scheduler Garbage collection - hogle + + +** Sprint 19 diff --git a/vrf-exploration.org b/vrf-exploration.org new file mode 100644 index 0000000..7deab4b --- /dev/null +++ b/vrf-exploration.org @@ -0,0 +1,43 @@ +#+TITLE: + + +Dangerous crew: +Jarrod +Ian +Tim + +Trix is an internal tool +we're going to use trix as a datasource + +Primary domain is to collect and expose raw telemetry data internally + +Trix itself as a collection tool (not aware of customers) + + +relationship data is in monolith but also Tenant API + + +VRF GA zoom should take a look at that + + +looking glass + + +what my routing table looks like +where it learned those routes from +is bgp up on this connection +what does the status look like +what does the neighbor status +what is the history of routes learned + + +not well structured normally +"shut them up" + + +show learned routes GCP +vrf/id/routes +use vni from vrf + + +CCIE diff --git a/workday-notes.org b/workday-notes.org new file mode 100644 index 0000000..8fcff22 --- /dev/null +++ b/workday-notes.org @@ -0,0 +1,138 @@ +* Goal: Expand our Market - Lay the foundation for product-led growth +In the Nautilus the biggest responsibility we have is the monolith, and as we've added people to the team, we're starting to add services that are new logic to services outside of the monolith. In order to make this simple, and reduce maintenance burden, I've created exoskeleton and algolyzer, which are go libraries that we can use to develop go services a bit more quickly. + +Exoskeleton provides a type-safe routing layer built on top of Gin, and bakes in OTEL so it's easy for us to take our services from local development to production ready. + +Algolyzer makes it easier to keep updating algolia indexes happen out of the request span, to keep latency low, while still making sure our UIs are able to be easily searched for relevant objects. + +Additionally, I have made a number of improvements to our core infrastructure: + +- Improving monitoring of our application to make major upgrades less scary +- Upgrading from Rails 5 to Rails 6 +- Upgrade from Ruby 2 to Ruby 3 +- Deploying and performing regular maintenance on our CockroachDB cluster +- Diagnose anycast routing issues with our CRDB deployment that led to unexpectedly high latency, which resulted in changing the network from equal path routing to prefer local. + +With these changes we're able to keep moving toward keeping the lights on while allowing us to experiment cheaply with common infra needed for smaller services. + + +* Goal: Build the foundation - A market-leading end-to-end user experience + + +As we started to deliver LBaaS, Infratographer had an entirely +different opinion on how to manage users and resource ownership, and I +created a GraphQL service to bridge the gap between infratographer +concepts and metal concepts, so when a customer uses the product, +it'll seem familiar. The metal API also emits events that can be +subscribed to over NATS to get updates for things such as organization +and project membership changes. + +In order to accomplish this it meant close collaboration with the +identity team to help establish the interfaces and decide on who is +responsible for what parts. Load balancers can now be provisioned and +act as if they belong to a project, even though the system of record +lies completely outside of the Metal API. + +VMC-E exposed that we had ordering issues in our VLAN assignments +portion of the networking stack. I worked with my team mates and SWNet +to improve the situation. I designed and implemented a queuing +solution that allows us to queue asynchronous tasks that are order +dependent on queues with a single consumer. We've already gotten +feedback from VMC-E and other customers that the correctness issues +with VLAN assignment have been solved, and we don't need to wait for a +complete networking overhaul from Orca to fix it. There are more +opportunities to target issues in our networking stack that suffer +from ordering issues with this solution. + +For federated SSO, I was able to help keep communication between +Platform Identity, Nautilus and Portals flowing smoothly by +documenting exactly what was needed to get us in a position to onboard +our first set of customers using SSO. I used my knowledge of OAuth2 an +OpenIDConnect and broke down the integration points in a document +shared between these teams so it was clear what we needed to do. This +made it easier to commit and deliver within the timeframe we set. + +not networking specific +nano metal +audit logging + + + +* Goal: DS FunctionalPriorities - Build, socialize, and execute on plan to improve engineering experience + +Throughout this year, I've been circulating ideas in writing and ins +hared forums more often. Within the nautilus team I did 8 tech-talks +to share ideas and information with the team and to solicit +feedback. I also wrote documents for collaborating with other teams +mainly for LBaaS (specifically around how it integrates with the +EMAPI) and federated SSO. + +- CRDB performance troubleshooting + + I discussed how I determined that anycast routing was not properly + weighted, and my methodology for designing tests to diagnose the issue. + +- Monitoring strategy for the API Rails/Ruby Upgrades + + Here I discussed how we intended to do these upgrades in a way that + built confidence on top of the confidence we got from our test + suites by measuring indicators of performance. + +- Recorded deployment and monitoring of API + + As we added more people to the team, recording this just made it + easier to have something we could point to for an API deployment. We + also have this process documented in the repo. + +- Deep diving caching issues from #_incent-1564 + + We ran into a very hard to reproduce error where a users accessing + the same organization with different users were returned the same + list of organizations/projects regardless of access. Although, the + API prevented actual reads to the objects that the user didn't have + proper access to, serving the wrong set of IDs produced unexpected + behavior in the Portal. It took a long time to diagnose this, and + then I discussed the results with the team. + +- API monitoring by thinking about what we actually deliver + + Related to the rails upgrades, being able to accurately measure the + health of the monolith requires periodically re-evaluating if we're + measuring what matters. + +- API Auth discussion with using identity-api + + Discussion on the potential uses for identity-api in a + service-to-service context that the API uses quite frequently as we + build functionality outside of the API. + +- Static analysis on Ruby + + With a dynamically typed language, runtime exceptions are no fun, + but some static analysis goes a long way. In this talk I explained + how it works at the AST level and how we can use this to enforce + conventions that we have adopted in the API. As an action item, I + started enabling useful "cops" to prevent common logic errors in + ruby. + +- Session Scheduler + + Here I discussed the problem and the solution that we implemented + to prevent VLANs from being in inconsistent states when assigned and + unassigned quickly. The solution we delivered was generic, and + solved the problem simply, and this talk was to shine some light on + the new tool that the team has to use for ordering problems. + + + +* Twilio account + + +always assisting the team +help new joinees to ramp up fast +participate in interviews +easy to work with across teams +clear communication +able to navigate +relations with delivery +not only engineering - product, devrel