Files
2024-11-17 08:34:59 -05:00

5.1 KiB

Testing IAM-Runtime checks for Metal API

What's changed

In the Metal API, there's now the ability to run different authorization policy engines. We have two engines, the cancancan engine and the permissions API engine. We added the ability to run these both during the span of a request while explicitly naming one the source of truth for the ultimate authorization outcome.

What are we trying to get out of this test?

We want to start sending authorization checks through permissions API, but not break existing behavior. We need a way to validate our permissions checks through the runtime behave as we expect.

The firts barrier to making sure we're not breaking production is to run the combined policies for all CI test cases. This proves for the tested code paths that we're at least able to serve requests.

This test plan deals with validating the policy definition and integration with Permissions API in production.

Stages of testing

There will be a few stages of rolling this out to careful since we're changing a fundamental piece of the architecture.

First, we'll run a smoke test suite against a canary which is separate from production traffic. Then, if the metrics look acceptable there, we'll roll this out to production, while keeping an eye specifically on latency and number of 403s. Then, we'll monitor for discrepancies between the models and address them. This will be a bulk of the testing time, as we'll need a long enough duration to receive an accurate sample size for operations customers perform. Finally, we can move over to only using the runtime for authorization decisions.

The next sections describe the test setup, what we'll monitor at each stage, and success criteria, and rollback procedure.

Initial Canary

In this setup, we'll have a separate ingress and deployment for the Metal API. This will allow us to exclusively route traffic to the backend configured to use the IAM runtime, while leaving production traffic using the cancan policy only.

The purpose of doing this is to try and find any hidden bugs that would cause an outage.

We'll test this by running the terraform CI tests against the canary image.

The success criteria for this step are:

  • CI test passes
  • CI test duration does not increase significantly compared to usual runtimes (canary CI runtime <= 150% normal runtime)

Typical CI tests uses API keys instead of an Identity API JWT, which would be necessary for a permissions API check, so I'll need to modify terraform to pull the credentials appropriately.

Rolling back here just involves cleaning up canary resources, and has no impact on customer experience.

Production Roll-out

In this setup, we'll set the appropriate configuratoin for all HTTP frontend pods. This will cause all requests to pass through both policy engines and start generating trace data.

The purpose of this stage is to start getting real production work passing through the permissions checks, but not yet starting to affect the result of a request.

The testing in this stage is just to see that the frontends are healthy, and that immediately serving a spike of 403s. The rest of the data will come from next stage of the test plan, Monitoring.

Rolling back here is restarting the API with `POLICY_ENGINE` unset, which defaults to only using cancancan.

Monitoring

The setup here is no different than the previous stage, but it is likely a bulk of the time, so I've separated it out. Here we'll be monitoring tracing data to look for differences in authorization decisions between the two engines.

The main failure we expect here is that the policy results differ, which means either our definition of the equivalent Metal API roles in Permissions API need to be updated, or potentially, that the logic that does the Metal API check is broken.

We can detect and I will create a HC dashboard to show authorization decisions that don't match, which can be due to the following reasons:

  • Policies are different
  • Computed incorrect tenant resource
  • Couldn't resolve the tenant resource

We can then address those issues on a case-by-case basis

We're also interested in the overal latency impact:

  • P95 Runtime Authorization check latency matches or is better than published Permission API latency

Completion criteria:

  • 100% accuracy on runtime checks that have been performed

There's probably a better metric here for determining "completeness", but as a goal, driving discrepancies down toward 0, is a good indicator that we're ready to cut-over completely.

As we close the gap between the two models, we might decide that some error is tolerable.

Runtime is the source of truth

Once we're here, we're happy with how the policy model is performing, we're ready to start using the runtime as the source of truth. This is just a configuation change to set POLICY_ENGINE to prefer_runtime or runtime_only.

Prefer runtime uses the runtime check where possible. We still need to address the staff authorizatoin model, so for that, we'll fall back to existing cancancan policies.

At this point, customers authenticating with an exchanged token will be served responses based on the permissions API policy.