From 63fe9cf740d7b7f27c784366afb32c0234bc5f0c Mon Sep 17 00:00:00 2001
From: Adam Mohammed <adam@fixergrid.net>
Date: Wed, 18 Sep 2024 17:39:12 -0400
Subject: [PATCH] Test plan

---
 .../permissions-migration/test-plan.org       | 133 ++++++++++++++++--
 1 file changed, 119 insertions(+), 14 deletions(-)

diff --git a/equinix/design/permissions-migration/test-plan.org b/equinix/design/permissions-migration/test-plan.org
index 7fa22e3..9661149 100644
--- a/equinix/design/permissions-migration/test-plan.org
+++ b/equinix/design/permissions-migration/test-plan.org
@@ -3,21 +3,126 @@
 
 * What's changed
 
+In the Metal API, there's now the ability to run different authorization
+policy engines. We have two engines, the cancancan engine and the
+permissions API engine. We added the ability to run these both during
+the span of a request while explicitly naming one the source of truth
+for the ultimate authorization outcome.
+
+* What are we trying to get out of this test?
+
+We want to start sending authorization checks through permissions API,
+but not break existing behavior. We need a way to validate our
+permissions checks through the runtime behave as we expect.
+
+The firts barrier to making sure we're not breaking production is to
+run the combined policies for all CI test cases. This proves for the
+tested code paths that we're at least able to serve requests.
+
+This test plan deals with validating the policy definition and
+integration with Permissions API in production.
+
 * Stages of testing
-- Initial Canary
-  - Run terraform against internal canary URL
-- Slow roll to production
-  - Watch for errors
-- In-production warn mode
-  - Observe for discrepancies between cancancan/iam-runtime
-- Runtime winning mode
-- Completed
 
-* Monitoring
-- Trace attributes that are relevant
+There will be a few stages of rolling this out to careful since we're
+changing a fundamental piece of the architecture.
 
-- Dashboards
-- Create dashboard around cancancan disagreements
-- Create dashboard where resource was not metal org/project/user
+First, we'll run a smoke test suite against a canary which is separate
+from production traffic.
+Then, if the metrics look acceptable there, we'll roll this out to
+production, while keeping an eye specifically on latency and number of
+403s.
+Then, we'll monitor for discrepancies between the models and address
+them. This will be a bulk of the testing time, as we'll need a long
+enough duration to receive an accurate sample size for operations
+customers perform.
+Finally, we can move over to only using the runtime for authorization
+decisions.
 
-* Handling broken cases
+The next sections describe the test setup, what we'll monitor at each
+stage, and success criteria, and rollback procedure.
+
+** Initial Canary
+
+In this setup, we'll have a separate ingress and deployment for the
+Metal API. This will allow us to exclusively route traffic to the
+backend configured to use the IAM runtime, while leaving production
+traffic using the cancan policy only.
+
+The purpose of doing this is to try and find any hidden bugs that
+would cause an outage.
+
+We'll test this by running the terraform CI tests against the canary image.
+
+The success criteria for this step are:
+- CI test passes
+- CI test duration does not increase significantly compared to usual
+  runtimes (canary CI runtime  <= 150% normal runtime)
+
+Rolling back here just involves cleaning up canary resources, and has
+no impact on customer experience.
+
+** Production Roll-out
+
+In this setup, we'll set the appropriate configuratoin for all HTTP
+frontend pods. This will cause all requests to pass through both
+policy engines and start generating trace data.
+
+The purpose of this stage is to start getting real production work
+passing through the permissions checks, but not yet starting to affect
+the result of a request.
+
+The testing in this stage is just to see that the frontends are
+healthy, and that immediately serving a spike of 403s. The rest of the
+data will come from next stage of the test plan, Monitoring.
+
+Rolling back here is restarting the API with `POLICY_ENGINE` unset,
+which defaults to only using cancancan.
+
+** Monitoring
+
+The setup here is no different than the previous stage, but it is
+likely a bulk of the time, so I've separated it out. Here we'll be
+monitoring tracing data to look for differences in authorization
+decisions between the two engines.
+
+The main failure we expect here is that the policy results differ,
+which means either our definition of the equivalent Metal API roles in
+Permissions API need to be updated, or potentially, that the logic
+that does the Metal API check is broken.
+
+We can detect and I will create a HC dashboard to show authorization
+decisions that don't match, which can be due to the following reasons:
+- Policies are different
+- Computed incorrect tenant resource
+- Couldn't resolve the tenant resource
+
+We can then address those issues on a case-by-case basis
+
+We're also interested in the overal latency impact:
+- P95 Runtime Authorization check latency matches or is better than published
+  Permission API latency
+
+Completion criteria:
+- 100% accuracy on runtime checks that have been performed
+
+There's probably a better metric here for determining "completeness",
+but as a goal, driving discrepancies down toward 0, is a good
+indicator that we're ready to cut-over completely.
+
+As we close the gap between the two models, we might decide that some
+error is tolerable.
+
+** Runtime is the source of truth
+
+Once we're here, we're happy with how the policy model is performing,
+we're ready to start using the runtime as the source of truth. This is
+just a configuation change to set =POLICY_ENGINE= to =prefer_runtime= or
+=runtime_only=.
+
+Prefer runtime uses the runtime check where possible. We still need to
+address the staff authorizatoin model, so for that, we'll fall back to
+existing cancancan policies.
+
+At this point, customers authenticating with an exchanged token will
+be served responses based on the permissions API policy.