Add the rambling on capability systems

2024-07-30 08:55:11 -04:00
parent 1ddc9f19f1
commit 1da31679cb
1 changed files with 152 additions and 0 deletions
--- a/equinix/design/capability-systems.org
+++ b/equinix/design/capability-systems.org
@@ -0,0 +1,152 @@
 * Bootstrapping trust in a capability model
 There are two basic ways to start the chain of trust with a capability
 model, either the resource server is started with a set of root
 capabilities that governs all the resources, or ambient authority is
 used to provide the initial trust.
 Let's take the IP example further, some IPAM service is supposed to
 govern the RFC1918 space for Equinix. Its provides an API for
 downstream services to request blocks of arbitrary size, so they can
 further allocate smaller blocks from those blocks.
 I think the easiest way is just to use ACLs for the initial set of
 capabilities, and once the service is live, the majority of requests
 would be using wrapped resources. Let's say this IPAM service allows
 creation of "root" ranges through a create range API.
 An operator could create the range for 10.0.0.0/8. And then create a
 wrapped resource to delegate to downstream services.
 If MCN, Metal and Fabric are all interested in sharing this IP space,
 we could have the service request a IP range of a specific size. Then
 the operator could create wrapped resources for larger ranges for each
 of the business units, and then hand those to the operators for the
 MCN/Metal and Fabric services.
 Once the dependent service gets their wrapped resource, they can
 further divide the resources if they have multiple services that want
 to allocate from distinct pools within that space, or they can all
 share the capability as-is.
 The dependent service could then make direct calls to the IPAM service
 to make "assignments" in the IPAM service to mark that that IP is
 currently in use within the larger range.
 Eventually, we want to get away from this operator X does operation
 for operator Y, because it means that
 Let's assume we made an IPAM service that has the following endpoints:
 - CREATE IP Range
  Adds an entry to allow the IPAM service to govern the range
  Returns a resource ID
 - LIST IP Ranges
  Lists all the ranges governed by the IPAM service
 - GET IP RANGE
  Shows details about the IP range, such as how much of the range is
  allocated.
  Can be accessed by either by ACL, or capability
 - DELETE IP Range
  Remove an IP range from being governed by the IPAM service
 - CREATE IP Range Request
  Request a capability which lets a service allocate from this IP Range
 - GET/LIST IP Range Request
  Show the status of a request
 - PUT IP Range Request
  Allows approving/denying the request
 - DELETE IP Range Request
  Removing an IP range request
 - CREATE IP Assignment
  Only accepts a wrapped resource, marks IP Address or subnet as allocated.
 Now we consider how we get to be able to start using
 capabilities. Initially, an operator needs to start the service by
 creating some IP ranges that the IPAM service is responsible for. This
 endpoint can use ACLs to check that the operator has the authorization
 to create ranges, and then the service can allow requests.
 Next, some service, like the Metal Provisioner needs to assign IPs to
 instances so they can talk to each other over the private
 network. Initially the provisioner doesn't have access to any IP
 ranges, so it sends a request for a /16. That /16 request is then
 approved by an IPAM operator, and the provisioner receives a
 capability that allows manipulating assignments on that range.
 The IAM operator portion could be removed
 ----
 IPAM Worked Example
 Let's assume we have an IPAM system which governs 10.0.0.0/8, and
 other IP blocks. We have a service, such as LBaaS which needs to
 assign Private IPs to customer Load balancer instances. The LBaaS
 service needs to assign unique IPs to the load balancer instances so
 that customer can route traffic to their metal instances.
 The LB service needs to reach out to the IPAM service to pull an IP,
 and to do that, it must request it within a block represented by a
 wrapped resource. So how does the service initially obtain this
 wrapped resource?
 On first startup, the LBaaS service knows it doesn't have the
 capability to assign IPs becasue it doesn't have a wrapped resource
 for the range. It reaches out authenticated as itself to the IPAM
 service, and requests a =/16=. That request is authorized just by the
 fact that the LB service has the correct audience to talk to the IPAM
 service.
 The request is recorded, and some approval process is done by the IPAM
 operators, or is determined by buisiness logic. Once approved, the
 wrapped resource for the requested range is issued to the LBaaS
 service, which it stores. Now, whenever an IP is needed, it makes an
 assignment under that wrapped resource.
 Internally, the IPAM service needs to record that a block is currently
 active, and that the capability sent to the LB service references
 it. As an example, let's say the 10.0.0.0/8 is represented by the root
 resource identifier `ntwkblk-a1b2c3`. When the LB service requests a
 =/16=, a new IP reservation resource is created `ntwkipr-xyzxyz`, and
 once approved, a capability is created, by calling,
 WrapResource(ntwkipr-xyzxyz, [create_assignment, read_assignment, delete_assignment],
 {}), which produces a wrapped resource with ID
 `ntwkipr-u8e82i.qeoalf` and the IPAM service distributes this back to
 the LB service.
 When the LB service wishes to record an assignment to that block, it
 can make a request to the IPAM services assignment endpoint,
 (e.g. POST /ip-reservations/ntwkipr-u8e82i.qeoalf/assignments). From
 there, the IPAM service calls, UnwrapResource(ntwkipr-u8e82i.qeoalf,
 [create_assignment], {}), which succeeds because the wrapped resource
 is valid, the verifier matches, and the operation is allowed for that
 ID. And the assignment is created.
 This example describes a manual approval process and doesn't
 necessarily describe how the async process is implemented for yieling
 the capability back to the requesting service. The manual approval
 process could easily be replaced by setting limits per identity, and
 requiring manual approval for higher limits, e.g. Any product can
 request a up to a /24, but if you want anything larger, you'll need
 manual approval by the governing team. In that case, the system
 becomes more dynamic and teams can self-serve their requests. The
 distribution of the capability must happen over a secure channel as
 well, such as a NATS topic that only the requesting service has access
 to, or by direct callback API.
 Futher delegation is possible as well, where the LB service could ask
 the IPAM service to wrap `ntwkipr-u8e82i-qeoalf` another time, but
 this time only to perform `read_assignment` and then the LB team can
 create operator tools to find details about the assignment from the
 IPAM service without having the ability to do damage.