129 lines
4.9 KiB
Org Mode
129 lines
4.9 KiB
Org Mode
#+TITLE: Testing IAM-Runtime checks for Metal API
|
|
#+AUTHOR: Adam Mohammed
|
|
|
|
* What's changed
|
|
|
|
In the Metal API, there's now the ability to run different authorization
|
|
policy engines. We have two engines, the cancancan engine and the
|
|
permissions API engine. We added the ability to run these both during
|
|
the span of a request while explicitly naming one the source of truth
|
|
for the ultimate authorization outcome.
|
|
|
|
* What are we trying to get out of this test?
|
|
|
|
We want to start sending authorization checks through permissions API,
|
|
but not break existing behavior. We need a way to validate our
|
|
permissions checks through the runtime behave as we expect.
|
|
|
|
The firts barrier to making sure we're not breaking production is to
|
|
run the combined policies for all CI test cases. This proves for the
|
|
tested code paths that we're at least able to serve requests.
|
|
|
|
This test plan deals with validating the policy definition and
|
|
integration with Permissions API in production.
|
|
|
|
* Stages of testing
|
|
|
|
There will be a few stages of rolling this out to careful since we're
|
|
changing a fundamental piece of the architecture.
|
|
|
|
First, we'll run a smoke test suite against a canary which is separate
|
|
from production traffic.
|
|
Then, if the metrics look acceptable there, we'll roll this out to
|
|
production, while keeping an eye specifically on latency and number of
|
|
403s.
|
|
Then, we'll monitor for discrepancies between the models and address
|
|
them. This will be a bulk of the testing time, as we'll need a long
|
|
enough duration to receive an accurate sample size for operations
|
|
customers perform.
|
|
Finally, we can move over to only using the runtime for authorization
|
|
decisions.
|
|
|
|
The next sections describe the test setup, what we'll monitor at each
|
|
stage, and success criteria, and rollback procedure.
|
|
|
|
** Initial Canary
|
|
|
|
In this setup, we'll have a separate ingress and deployment for the
|
|
Metal API. This will allow us to exclusively route traffic to the
|
|
backend configured to use the IAM runtime, while leaving production
|
|
traffic using the cancan policy only.
|
|
|
|
The purpose of doing this is to try and find any hidden bugs that
|
|
would cause an outage.
|
|
|
|
We'll test this by running the terraform CI tests against the canary image.
|
|
|
|
The success criteria for this step are:
|
|
- CI test passes
|
|
- CI test duration does not increase significantly compared to usual
|
|
runtimes (canary CI runtime <= 150% normal runtime)
|
|
|
|
Rolling back here just involves cleaning up canary resources, and has
|
|
no impact on customer experience.
|
|
|
|
** Production Roll-out
|
|
|
|
In this setup, we'll set the appropriate configuratoin for all HTTP
|
|
frontend pods. This will cause all requests to pass through both
|
|
policy engines and start generating trace data.
|
|
|
|
The purpose of this stage is to start getting real production work
|
|
passing through the permissions checks, but not yet starting to affect
|
|
the result of a request.
|
|
|
|
The testing in this stage is just to see that the frontends are
|
|
healthy, and that immediately serving a spike of 403s. The rest of the
|
|
data will come from next stage of the test plan, Monitoring.
|
|
|
|
Rolling back here is restarting the API with `POLICY_ENGINE` unset,
|
|
which defaults to only using cancancan.
|
|
|
|
** Monitoring
|
|
|
|
The setup here is no different than the previous stage, but it is
|
|
likely a bulk of the time, so I've separated it out. Here we'll be
|
|
monitoring tracing data to look for differences in authorization
|
|
decisions between the two engines.
|
|
|
|
The main failure we expect here is that the policy results differ,
|
|
which means either our definition of the equivalent Metal API roles in
|
|
Permissions API need to be updated, or potentially, that the logic
|
|
that does the Metal API check is broken.
|
|
|
|
We can detect and I will create a HC dashboard to show authorization
|
|
decisions that don't match, which can be due to the following reasons:
|
|
- Policies are different
|
|
- Computed incorrect tenant resource
|
|
- Couldn't resolve the tenant resource
|
|
|
|
We can then address those issues on a case-by-case basis
|
|
|
|
We're also interested in the overal latency impact:
|
|
- P95 Runtime Authorization check latency matches or is better than published
|
|
Permission API latency
|
|
|
|
Completion criteria:
|
|
- 100% accuracy on runtime checks that have been performed
|
|
|
|
There's probably a better metric here for determining "completeness",
|
|
but as a goal, driving discrepancies down toward 0, is a good
|
|
indicator that we're ready to cut-over completely.
|
|
|
|
As we close the gap between the two models, we might decide that some
|
|
error is tolerable.
|
|
|
|
** Runtime is the source of truth
|
|
|
|
Once we're here, we're happy with how the policy model is performing,
|
|
we're ready to start using the runtime as the source of truth. This is
|
|
just a configuation change to set =POLICY_ENGINE= to =prefer_runtime= or
|
|
=runtime_only=.
|
|
|
|
Prefer runtime uses the runtime check where possible. We still need to
|
|
address the staff authorizatoin model, so for that, we'll fall back to
|
|
existing cancancan policies.
|
|
|
|
At this point, customers authenticating with an exchanged token will
|
|
be served responses based on the permissions API policy.
|