From 63fe9cf740d7b7f27c784366afb32c0234bc5f0c Mon Sep 17 00:00:00 2001 From: Adam Mohammed Date: Wed, 18 Sep 2024 17:39:12 -0400 Subject: [PATCH] Test plan --- .../permissions-migration/test-plan.org | 133 ++++++++++++++++-- 1 file changed, 119 insertions(+), 14 deletions(-) diff --git a/equinix/design/permissions-migration/test-plan.org b/equinix/design/permissions-migration/test-plan.org index 7fa22e3..9661149 100644 --- a/equinix/design/permissions-migration/test-plan.org +++ b/equinix/design/permissions-migration/test-plan.org @@ -3,21 +3,126 @@ * What's changed +In the Metal API, there's now the ability to run different authorization +policy engines. We have two engines, the cancancan engine and the +permissions API engine. We added the ability to run these both during +the span of a request while explicitly naming one the source of truth +for the ultimate authorization outcome. + +* What are we trying to get out of this test? + +We want to start sending authorization checks through permissions API, +but not break existing behavior. We need a way to validate our +permissions checks through the runtime behave as we expect. + +The firts barrier to making sure we're not breaking production is to +run the combined policies for all CI test cases. This proves for the +tested code paths that we're at least able to serve requests. + +This test plan deals with validating the policy definition and +integration with Permissions API in production. + * Stages of testing -- Initial Canary - - Run terraform against internal canary URL -- Slow roll to production - - Watch for errors -- In-production warn mode - - Observe for discrepancies between cancancan/iam-runtime -- Runtime winning mode -- Completed -* Monitoring -- Trace attributes that are relevant +There will be a few stages of rolling this out to careful since we're +changing a fundamental piece of the architecture. -- Dashboards -- Create dashboard around cancancan disagreements -- Create dashboard where resource was not metal org/project/user +First, we'll run a smoke test suite against a canary which is separate +from production traffic. +Then, if the metrics look acceptable there, we'll roll this out to +production, while keeping an eye specifically on latency and number of +403s. +Then, we'll monitor for discrepancies between the models and address +them. This will be a bulk of the testing time, as we'll need a long +enough duration to receive an accurate sample size for operations +customers perform. +Finally, we can move over to only using the runtime for authorization +decisions. -* Handling broken cases +The next sections describe the test setup, what we'll monitor at each +stage, and success criteria, and rollback procedure. + +** Initial Canary + +In this setup, we'll have a separate ingress and deployment for the +Metal API. This will allow us to exclusively route traffic to the +backend configured to use the IAM runtime, while leaving production +traffic using the cancan policy only. + +The purpose of doing this is to try and find any hidden bugs that +would cause an outage. + +We'll test this by running the terraform CI tests against the canary image. + +The success criteria for this step are: +- CI test passes +- CI test duration does not increase significantly compared to usual + runtimes (canary CI runtime <= 150% normal runtime) + +Rolling back here just involves cleaning up canary resources, and has +no impact on customer experience. + +** Production Roll-out + +In this setup, we'll set the appropriate configuratoin for all HTTP +frontend pods. This will cause all requests to pass through both +policy engines and start generating trace data. + +The purpose of this stage is to start getting real production work +passing through the permissions checks, but not yet starting to affect +the result of a request. + +The testing in this stage is just to see that the frontends are +healthy, and that immediately serving a spike of 403s. The rest of the +data will come from next stage of the test plan, Monitoring. + +Rolling back here is restarting the API with `POLICY_ENGINE` unset, +which defaults to only using cancancan. + +** Monitoring + +The setup here is no different than the previous stage, but it is +likely a bulk of the time, so I've separated it out. Here we'll be +monitoring tracing data to look for differences in authorization +decisions between the two engines. + +The main failure we expect here is that the policy results differ, +which means either our definition of the equivalent Metal API roles in +Permissions API need to be updated, or potentially, that the logic +that does the Metal API check is broken. + +We can detect and I will create a HC dashboard to show authorization +decisions that don't match, which can be due to the following reasons: +- Policies are different +- Computed incorrect tenant resource +- Couldn't resolve the tenant resource + +We can then address those issues on a case-by-case basis + +We're also interested in the overal latency impact: +- P95 Runtime Authorization check latency matches or is better than published + Permission API latency + +Completion criteria: +- 100% accuracy on runtime checks that have been performed + +There's probably a better metric here for determining "completeness", +but as a goal, driving discrepancies down toward 0, is a good +indicator that we're ready to cut-over completely. + +As we close the gap between the two models, we might decide that some +error is tolerable. + +** Runtime is the source of truth + +Once we're here, we're happy with how the policy model is performing, +we're ready to start using the runtime as the source of truth. This is +just a configuation change to set =POLICY_ENGINE= to =prefer_runtime= or +=runtime_only=. + +Prefer runtime uses the runtime check where possible. We still need to +address the staff authorizatoin model, so for that, we'll fall back to +existing cancancan policies. + +At this point, customers authenticating with an exchanged token will +be served responses based on the permissions API policy.