This commit is contained in:
Adam Mohammed
2023-10-30 15:30:26 -04:00
parent 6f8d6220fa
commit 12cf3967ee
5 changed files with 268 additions and 0 deletions

17
design/vrf-bgp-routes.org Normal file
View File

@@ -0,0 +1,17 @@
#+TITLE: Implementing endpoint to expose learned routes
#+AUTHOR: Adam Mohammed
#+DATE: October 30, 2023
I asked Tim about what the difference is between
https://deploy.equinix.com/developers/api/metal#tag/VRFs/operation/getBgpDynamicNeighbors
and what's available in Trix.
The first is just data configured by the customer manually.
Trix exposes the learned routes from peers
I then asked if it made sense to expose the data as part of:
https://deploy.equinix.com/developers/api/metal#tag/VRFs/operation/getVrfRoutes
and the answer I got was probaly

View File

@@ -0,0 +1,47 @@
#+TITLE: K8s concepts review
#+AUTHOR: Adam Mohammed
#+DATE: September 18, 2023
At one of the meetings I brought up how similar nano-metal felt to a
collection of K8s specifications that make standing up and managing
K8s clusters easier. In this document I'll cover the following topics
at a high level: Cluster API, CNI, CCM, and CSI.
First is the Cluster API, this came about as a means for creating and
managing kubernetes clusters using kubernetes it self. The cluster API
allows an operator to use a so-called "management cluster" to create
other K8s clusters known as "Workload clusters." The cluster API is
NOT part of the core K8s resources, but is implemented as a set of
custom resource definitions and controllers to actually carry out the
desired actions.
A cluster operator can use the cluster-api to create workload clusters
by relying on 3 components: bootstrap provider, infrastructure
provider, and the control plane provider. Nanometal aims at making
provisioning of bare metal machines extensible and scalable by
enabling facilities to carry out the desired operations requested by
the EMAPI. We can think of the EMAPI as the "management cluster" in
this world.
What Metal has today, maps well to the infrastructure provider, since
all the cluster-api has to do is ask for machines with a certain
configuration and the provider is responsible for making that
happen. I think for this project a bulk of this work is figuring out
how we make the infrastructure provider out of our existing
components, but let's put that aside for right now and consider the
rest of the components.
The bootstrap and the control plane providers are concepts that also
seem important to our goal. We want it to be simple for us to enter a
new facility and set up the components we need to start provisioning
hardware. The bootstrap provider, in the cluster-api concepts turns a
server provisioned with a base OS into an operating K8s node. For us,
we probably would also want some process which turns any facility or
existing datacenter, into an equinix metal managed facility.
Once we know about the facility that we need to manage, the concept of
the control plane provider maps well with the diagrams from Nanometal
so far. We'd want some component that installs the required agent and
supporting components in the facilty so we can start to be able to
provide metal services there.

View File

@@ -0,0 +1,23 @@
* People
- Emi
- Hogle
- Lucas
- Laurence
- Ian
- Thiago
- Sahil
- Abhishek
- Navaneeth
- Surbhit
- Nikita
- Shelby
** Remaining work:
*** Integration testing with server service - Laurence
*** Review nanometal doc with provisioning team - Ian
*** Update automation of oapi-codegen and sqlboiler - Abhishek
*** Serialized Scheduler Garbage collection - hogle
** Sprint 19

43
vrf-exploration.org Normal file
View File

@@ -0,0 +1,43 @@
#+TITLE:
Dangerous crew:
Jarrod
Ian
Tim
Trix is an internal tool
we're going to use trix as a datasource
Primary domain is to collect and expose raw telemetry data internally
Trix itself as a collection tool (not aware of customers)
relationship data is in monolith but also Tenant API
VRF GA zoom should take a look at that
looking glass
what my routing table looks like
where it learned those routes from
is bgp up on this connection
what does the status look like
what does the neighbor status
what is the history of routes learned
not well structured normally
"shut them up"
show learned routes GCP
vrf/id/routes
use vni from vrf
CCIE

138
workday-notes.org Normal file
View File

@@ -0,0 +1,138 @@
* Goal: Expand our Market - Lay the foundation for product-led growth
In the Nautilus the biggest responsibility we have is the monolith, and as we've added people to the team, we're starting to add services that are new logic to services outside of the monolith. In order to make this simple, and reduce maintenance burden, I've created exoskeleton and algolyzer, which are go libraries that we can use to develop go services a bit more quickly.
Exoskeleton provides a type-safe routing layer built on top of Gin, and bakes in OTEL so it's easy for us to take our services from local development to production ready.
Algolyzer makes it easier to keep updating algolia indexes happen out of the request span, to keep latency low, while still making sure our UIs are able to be easily searched for relevant objects.
Additionally, I have made a number of improvements to our core infrastructure:
- Improving monitoring of our application to make major upgrades less scary
- Upgrading from Rails 5 to Rails 6
- Upgrade from Ruby 2 to Ruby 3
- Deploying and performing regular maintenance on our CockroachDB cluster
- Diagnose anycast routing issues with our CRDB deployment that led to unexpectedly high latency, which resulted in changing the network from equal path routing to prefer local.
With these changes we're able to keep moving toward keeping the lights on while allowing us to experiment cheaply with common infra needed for smaller services.
* Goal: Build the foundation - A market-leading end-to-end user experience
As we started to deliver LBaaS, Infratographer had an entirely
different opinion on how to manage users and resource ownership, and I
created a GraphQL service to bridge the gap between infratographer
concepts and metal concepts, so when a customer uses the product,
it'll seem familiar. The metal API also emits events that can be
subscribed to over NATS to get updates for things such as organization
and project membership changes.
In order to accomplish this it meant close collaboration with the
identity team to help establish the interfaces and decide on who is
responsible for what parts. Load balancers can now be provisioned and
act as if they belong to a project, even though the system of record
lies completely outside of the Metal API.
VMC-E exposed that we had ordering issues in our VLAN assignments
portion of the networking stack. I worked with my team mates and SWNet
to improve the situation. I designed and implemented a queuing
solution that allows us to queue asynchronous tasks that are order
dependent on queues with a single consumer. We've already gotten
feedback from VMC-E and other customers that the correctness issues
with VLAN assignment have been solved, and we don't need to wait for a
complete networking overhaul from Orca to fix it. There are more
opportunities to target issues in our networking stack that suffer
from ordering issues with this solution.
For federated SSO, I was able to help keep communication between
Platform Identity, Nautilus and Portals flowing smoothly by
documenting exactly what was needed to get us in a position to onboard
our first set of customers using SSO. I used my knowledge of OAuth2 an
OpenIDConnect and broke down the integration points in a document
shared between these teams so it was clear what we needed to do. This
made it easier to commit and deliver within the timeframe we set.
not networking specific
nano metal
audit logging
* Goal: DS FunctionalPriorities - Build, socialize, and execute on plan to improve engineering experience
Throughout this year, I've been circulating ideas in writing and ins
hared forums more often. Within the nautilus team I did 8 tech-talks
to share ideas and information with the team and to solicit
feedback. I also wrote documents for collaborating with other teams
mainly for LBaaS (specifically around how it integrates with the
EMAPI) and federated SSO.
- CRDB performance troubleshooting
I discussed how I determined that anycast routing was not properly
weighted, and my methodology for designing tests to diagnose the issue.
- Monitoring strategy for the API Rails/Ruby Upgrades
Here I discussed how we intended to do these upgrades in a way that
built confidence on top of the confidence we got from our test
suites by measuring indicators of performance.
- Recorded deployment and monitoring of API
As we added more people to the team, recording this just made it
easier to have something we could point to for an API deployment. We
also have this process documented in the repo.
- Deep diving caching issues from #_incent-1564
We ran into a very hard to reproduce error where a users accessing
the same organization with different users were returned the same
list of organizations/projects regardless of access. Although, the
API prevented actual reads to the objects that the user didn't have
proper access to, serving the wrong set of IDs produced unexpected
behavior in the Portal. It took a long time to diagnose this, and
then I discussed the results with the team.
- API monitoring by thinking about what we actually deliver
Related to the rails upgrades, being able to accurately measure the
health of the monolith requires periodically re-evaluating if we're
measuring what matters.
- API Auth discussion with using identity-api
Discussion on the potential uses for identity-api in a
service-to-service context that the API uses quite frequently as we
build functionality outside of the API.
- Static analysis on Ruby
With a dynamically typed language, runtime exceptions are no fun,
but some static analysis goes a long way. In this talk I explained
how it works at the AST level and how we can use this to enforce
conventions that we have adopted in the API. As an action item, I
started enabling useful "cops" to prevent common logic errors in
ruby.
- Session Scheduler
Here I discussed the problem and the solution that we implemented
to prevent VLANs from being in inconsistent states when assigned and
unassigned quickly. The solution we delivered was generic, and
solved the problem simply, and this talk was to shine some light on
the new tool that the team has to use for ordering problems.
* Twilio account
always assisting the team
help new joinees to ramp up fast
participate in interviews
easy to work with across teams
clear communication
able to navigate
relations with delivery
not only engineering - product, devrel