moar
This commit is contained in:
138
workday-notes.org
Normal file
138
workday-notes.org
Normal file
@@ -0,0 +1,138 @@
|
||||
* Goal: Expand our Market - Lay the foundation for product-led growth
|
||||
In the Nautilus the biggest responsibility we have is the monolith, and as we've added people to the team, we're starting to add services that are new logic to services outside of the monolith. In order to make this simple, and reduce maintenance burden, I've created exoskeleton and algolyzer, which are go libraries that we can use to develop go services a bit more quickly.
|
||||
|
||||
Exoskeleton provides a type-safe routing layer built on top of Gin, and bakes in OTEL so it's easy for us to take our services from local development to production ready.
|
||||
|
||||
Algolyzer makes it easier to keep updating algolia indexes happen out of the request span, to keep latency low, while still making sure our UIs are able to be easily searched for relevant objects.
|
||||
|
||||
Additionally, I have made a number of improvements to our core infrastructure:
|
||||
|
||||
- Improving monitoring of our application to make major upgrades less scary
|
||||
- Upgrading from Rails 5 to Rails 6
|
||||
- Upgrade from Ruby 2 to Ruby 3
|
||||
- Deploying and performing regular maintenance on our CockroachDB cluster
|
||||
- Diagnose anycast routing issues with our CRDB deployment that led to unexpectedly high latency, which resulted in changing the network from equal path routing to prefer local.
|
||||
|
||||
With these changes we're able to keep moving toward keeping the lights on while allowing us to experiment cheaply with common infra needed for smaller services.
|
||||
|
||||
|
||||
* Goal: Build the foundation - A market-leading end-to-end user experience
|
||||
|
||||
|
||||
As we started to deliver LBaaS, Infratographer had an entirely
|
||||
different opinion on how to manage users and resource ownership, and I
|
||||
created a GraphQL service to bridge the gap between infratographer
|
||||
concepts and metal concepts, so when a customer uses the product,
|
||||
it'll seem familiar. The metal API also emits events that can be
|
||||
subscribed to over NATS to get updates for things such as organization
|
||||
and project membership changes.
|
||||
|
||||
In order to accomplish this it meant close collaboration with the
|
||||
identity team to help establish the interfaces and decide on who is
|
||||
responsible for what parts. Load balancers can now be provisioned and
|
||||
act as if they belong to a project, even though the system of record
|
||||
lies completely outside of the Metal API.
|
||||
|
||||
VMC-E exposed that we had ordering issues in our VLAN assignments
|
||||
portion of the networking stack. I worked with my team mates and SWNet
|
||||
to improve the situation. I designed and implemented a queuing
|
||||
solution that allows us to queue asynchronous tasks that are order
|
||||
dependent on queues with a single consumer. We've already gotten
|
||||
feedback from VMC-E and other customers that the correctness issues
|
||||
with VLAN assignment have been solved, and we don't need to wait for a
|
||||
complete networking overhaul from Orca to fix it. There are more
|
||||
opportunities to target issues in our networking stack that suffer
|
||||
from ordering issues with this solution.
|
||||
|
||||
For federated SSO, I was able to help keep communication between
|
||||
Platform Identity, Nautilus and Portals flowing smoothly by
|
||||
documenting exactly what was needed to get us in a position to onboard
|
||||
our first set of customers using SSO. I used my knowledge of OAuth2 an
|
||||
OpenIDConnect and broke down the integration points in a document
|
||||
shared between these teams so it was clear what we needed to do. This
|
||||
made it easier to commit and deliver within the timeframe we set.
|
||||
|
||||
not networking specific
|
||||
nano metal
|
||||
audit logging
|
||||
|
||||
|
||||
|
||||
* Goal: DS FunctionalPriorities - Build, socialize, and execute on plan to improve engineering experience
|
||||
|
||||
Throughout this year, I've been circulating ideas in writing and ins
|
||||
hared forums more often. Within the nautilus team I did 8 tech-talks
|
||||
to share ideas and information with the team and to solicit
|
||||
feedback. I also wrote documents for collaborating with other teams
|
||||
mainly for LBaaS (specifically around how it integrates with the
|
||||
EMAPI) and federated SSO.
|
||||
|
||||
- CRDB performance troubleshooting
|
||||
|
||||
I discussed how I determined that anycast routing was not properly
|
||||
weighted, and my methodology for designing tests to diagnose the issue.
|
||||
|
||||
- Monitoring strategy for the API Rails/Ruby Upgrades
|
||||
|
||||
Here I discussed how we intended to do these upgrades in a way that
|
||||
built confidence on top of the confidence we got from our test
|
||||
suites by measuring indicators of performance.
|
||||
|
||||
- Recorded deployment and monitoring of API
|
||||
|
||||
As we added more people to the team, recording this just made it
|
||||
easier to have something we could point to for an API deployment. We
|
||||
also have this process documented in the repo.
|
||||
|
||||
- Deep diving caching issues from #_incent-1564
|
||||
|
||||
We ran into a very hard to reproduce error where a users accessing
|
||||
the same organization with different users were returned the same
|
||||
list of organizations/projects regardless of access. Although, the
|
||||
API prevented actual reads to the objects that the user didn't have
|
||||
proper access to, serving the wrong set of IDs produced unexpected
|
||||
behavior in the Portal. It took a long time to diagnose this, and
|
||||
then I discussed the results with the team.
|
||||
|
||||
- API monitoring by thinking about what we actually deliver
|
||||
|
||||
Related to the rails upgrades, being able to accurately measure the
|
||||
health of the monolith requires periodically re-evaluating if we're
|
||||
measuring what matters.
|
||||
|
||||
- API Auth discussion with using identity-api
|
||||
|
||||
Discussion on the potential uses for identity-api in a
|
||||
service-to-service context that the API uses quite frequently as we
|
||||
build functionality outside of the API.
|
||||
|
||||
- Static analysis on Ruby
|
||||
|
||||
With a dynamically typed language, runtime exceptions are no fun,
|
||||
but some static analysis goes a long way. In this talk I explained
|
||||
how it works at the AST level and how we can use this to enforce
|
||||
conventions that we have adopted in the API. As an action item, I
|
||||
started enabling useful "cops" to prevent common logic errors in
|
||||
ruby.
|
||||
|
||||
- Session Scheduler
|
||||
|
||||
Here I discussed the problem and the solution that we implemented
|
||||
to prevent VLANs from being in inconsistent states when assigned and
|
||||
unassigned quickly. The solution we delivered was generic, and
|
||||
solved the problem simply, and this talk was to shine some light on
|
||||
the new tool that the team has to use for ordering problems.
|
||||
|
||||
|
||||
|
||||
* Twilio account
|
||||
|
||||
|
||||
always assisting the team
|
||||
help new joinees to ramp up fast
|
||||
participate in interviews
|
||||
easy to work with across teams
|
||||
clear communication
|
||||
able to navigate
|
||||
relations with delivery
|
||||
not only engineering - product, devrel
|
||||
Reference in New Issue
Block a user