From 1da31679cb0e991d4af42930036bd18d637700bd Mon Sep 17 00:00:00 2001 From: Adam Mohammed Date: Tue, 30 Jul 2024 08:55:11 -0400 Subject: [PATCH] Add the rambling on capability systems --- equinix/design/capability-systems.org | 152 ++++++++++++++++++++++++++ 1 file changed, 152 insertions(+) create mode 100644 equinix/design/capability-systems.org diff --git a/equinix/design/capability-systems.org b/equinix/design/capability-systems.org new file mode 100644 index 0000000..b7f6712 --- /dev/null +++ b/equinix/design/capability-systems.org @@ -0,0 +1,152 @@ +* Bootstrapping trust in a capability model + +There are two basic ways to start the chain of trust with a capability +model, either the resource server is started with a set of root +capabilities that governs all the resources, or ambient authority is +used to provide the initial trust. + +Let's take the IP example further, some IPAM service is supposed to +govern the RFC1918 space for Equinix. Its provides an API for +downstream services to request blocks of arbitrary size, so they can +further allocate smaller blocks from those blocks. + +I think the easiest way is just to use ACLs for the initial set of +capabilities, and once the service is live, the majority of requests +would be using wrapped resources. Let's say this IPAM service allows +creation of "root" ranges through a create range API. +An operator could create the range for 10.0.0.0/8. And then create a +wrapped resource to delegate to downstream services. + +If MCN, Metal and Fabric are all interested in sharing this IP space, +we could have the service request a IP range of a specific size. Then +the operator could create wrapped resources for larger ranges for each +of the business units, and then hand those to the operators for the +MCN/Metal and Fabric services. + +Once the dependent service gets their wrapped resource, they can +further divide the resources if they have multiple services that want +to allocate from distinct pools within that space, or they can all +share the capability as-is. + +The dependent service could then make direct calls to the IPAM service +to make "assignments" in the IPAM service to mark that that IP is +currently in use within the larger range. + +Eventually, we want to get away from this operator X does operation +for operator Y, because it means that + + +Let's assume we made an IPAM service that has the following endpoints: + +- CREATE IP Range + Adds an entry to allow the IPAM service to govern the range + Returns a resource ID +- LIST IP Ranges + Lists all the ranges governed by the IPAM service +- GET IP RANGE + Shows details about the IP range, such as how much of the range is + allocated. + + Can be accessed by either by ACL, or capability + +- DELETE IP Range + Remove an IP range from being governed by the IPAM service + +- CREATE IP Range Request + Request a capability which lets a service allocate from this IP Range + +- GET/LIST IP Range Request + Show the status of a request + +- PUT IP Range Request + Allows approving/denying the request + +- DELETE IP Range Request + Removing an IP range request + +- CREATE IP Assignment + Only accepts a wrapped resource, marks IP Address or subnet as allocated. + + +Now we consider how we get to be able to start using +capabilities. Initially, an operator needs to start the service by +creating some IP ranges that the IPAM service is responsible for. This +endpoint can use ACLs to check that the operator has the authorization +to create ranges, and then the service can allow requests. + +Next, some service, like the Metal Provisioner needs to assign IPs to +instances so they can talk to each other over the private +network. Initially the provisioner doesn't have access to any IP +ranges, so it sends a request for a /16. That /16 request is then +approved by an IPAM operator, and the provisioner receives a +capability that allows manipulating assignments on that range. + + +The IAM operator portion could be removed + + + +---- + +IPAM Worked Example + +Let's assume we have an IPAM system which governs 10.0.0.0/8, and +other IP blocks. We have a service, such as LBaaS which needs to +assign Private IPs to customer Load balancer instances. The LBaaS +service needs to assign unique IPs to the load balancer instances so +that customer can route traffic to their metal instances. + +The LB service needs to reach out to the IPAM service to pull an IP, +and to do that, it must request it within a block represented by a +wrapped resource. So how does the service initially obtain this +wrapped resource? + +On first startup, the LBaaS service knows it doesn't have the +capability to assign IPs becasue it doesn't have a wrapped resource +for the range. It reaches out authenticated as itself to the IPAM +service, and requests a =/16=. That request is authorized just by the +fact that the LB service has the correct audience to talk to the IPAM +service. + +The request is recorded, and some approval process is done by the IPAM +operators, or is determined by buisiness logic. Once approved, the +wrapped resource for the requested range is issued to the LBaaS +service, which it stores. Now, whenever an IP is needed, it makes an +assignment under that wrapped resource. + +Internally, the IPAM service needs to record that a block is currently +active, and that the capability sent to the LB service references +it. As an example, let's say the 10.0.0.0/8 is represented by the root +resource identifier `ntwkblk-a1b2c3`. When the LB service requests a +=/16=, a new IP reservation resource is created `ntwkipr-xyzxyz`, and +once approved, a capability is created, by calling, +WrapResource(ntwkipr-xyzxyz, [create_assignment, read_assignment, delete_assignment], +{}), which produces a wrapped resource with ID +`ntwkipr-u8e82i.qeoalf` and the IPAM service distributes this back to +the LB service. + +When the LB service wishes to record an assignment to that block, it +can make a request to the IPAM services assignment endpoint, +(e.g. POST /ip-reservations/ntwkipr-u8e82i.qeoalf/assignments). From +there, the IPAM service calls, UnwrapResource(ntwkipr-u8e82i.qeoalf, +[create_assignment], {}), which succeeds because the wrapped resource +is valid, the verifier matches, and the operation is allowed for that +ID. And the assignment is created. + +This example describes a manual approval process and doesn't +necessarily describe how the async process is implemented for yieling +the capability back to the requesting service. The manual approval +process could easily be replaced by setting limits per identity, and +requiring manual approval for higher limits, e.g. Any product can +request a up to a /24, but if you want anything larger, you'll need +manual approval by the governing team. In that case, the system +becomes more dynamic and teams can self-serve their requests. The +distribution of the capability must happen over a secure channel as +well, such as a NATS topic that only the requesting service has access +to, or by direct callback API. + +Futher delegation is possible as well, where the LB service could ask +the IPAM service to wrap `ntwkipr-u8e82i-qeoalf` another time, but +this time only to perform `read_assignment` and then the LB team can +create operator tools to find details about the assignment from the +IPAM service without having the ability to do damage.