org-notes/equinix/design/permissions-migration/test-plan.org

#+TITLE:  Testing IAM-Runtime checks for Metal API
#+AUTHOR: Adam Mohammed

* What's changed

In the Metal API, there's now the ability to run different authorization
policy engines. We have two engines, the cancancan engine and the
permissions API engine. We added the ability to run these both during
the span of a request while explicitly naming one the source of truth
for the ultimate authorization outcome.

* What are we trying to get out of this test?

We want to start sending authorization checks through permissions API,
but not break existing behavior. We need a way to validate our
permissions checks through the runtime behave as we expect.

The firts barrier to making sure we're not breaking production is to
run the combined policies for all CI test cases. This proves for the
tested code paths that we're at least able to serve requests.

This test plan deals with validating the policy definition and
integration with Permissions API in production.

* Stages of testing

There will be a few stages of rolling this out to careful since we're
changing a fundamental piece of the architecture.

First, we'll run a smoke test suite against a canary which is separate
from production traffic.
Then, if the metrics look acceptable there, we'll roll this out to
production, while keeping an eye specifically on latency and number of
403s.
Then, we'll monitor for discrepancies between the models and address
them. This will be a bulk of the testing time, as we'll need a long
enough duration to receive an accurate sample size for operations
customers perform.
Finally, we can move over to only using the runtime for authorization
decisions.

The next sections describe the test setup, what we'll monitor at each
stage, and success criteria, and rollback procedure.

** Initial Canary

In this setup, we'll have a separate ingress and deployment for the
Metal API. This will allow us to exclusively route traffic to the
backend configured to use the IAM runtime, while leaving production
traffic using the cancan policy only.

The purpose of doing this is to try and find any hidden bugs that
would cause an outage.

We'll test this by running the terraform CI tests against the canary image.

The success criteria for this step are:
- CI test passes
- CI test duration does not increase significantly compared to usual
  runtimes (canary CI runtime  <= 150% normal runtime)

Rolling back here just involves cleaning up canary resources, and has
no impact on customer experience.

** Production Roll-out

In this setup, we'll set the appropriate configuratoin for all HTTP
frontend pods. This will cause all requests to pass through both
policy engines and start generating trace data.

The purpose of this stage is to start getting real production work
passing through the permissions checks, but not yet starting to affect
the result of a request.

The testing in this stage is just to see that the frontends are
healthy, and that immediately serving a spike of 403s. The rest of the
data will come from next stage of the test plan, Monitoring.

Rolling back here is restarting the API with `POLICY_ENGINE` unset,
which defaults to only using cancancan.

** Monitoring

The setup here is no different than the previous stage, but it is
likely a bulk of the time, so I've separated it out. Here we'll be
monitoring tracing data to look for differences in authorization
decisions between the two engines.

The main failure we expect here is that the policy results differ,
which means either our definition of the equivalent Metal API roles in
Permissions API need to be updated, or potentially, that the logic
that does the Metal API check is broken.

We can detect and I will create a HC dashboard to show authorization
decisions that don't match, which can be due to the following reasons:
- Policies are different
- Computed incorrect tenant resource
- Couldn't resolve the tenant resource

We can then address those issues on a case-by-case basis

We're also interested in the overal latency impact:
- P95 Runtime Authorization check latency matches or is better than published
  Permission API latency

Completion criteria:
- 100% accuracy on runtime checks that have been performed

There's probably a better metric here for determining "completeness",
but as a goal, driving discrepancies down toward 0, is a good
indicator that we're ready to cut-over completely.

As we close the gap between the two models, we might decide that some
error is tolerable.

** Runtime is the source of truth

Once we're here, we're happy with how the policy model is performing,
we're ready to start using the runtime as the source of truth. This is
just a configuation change to set =POLICY_ENGINE= to =prefer_runtime= or
=runtime_only=.

Prefer runtime uses the runtime check where possible. We still need to
address the staff authorizatoin model, so for that, we'll fall back to
existing cancancan policies.

At this point, customers authenticating with an exchanged token will
be served responses based on the permissions API policy.