Add the rambling on capability systems

2024-07-30 08:55:11 -04:00
parent 1ddc9f19f1
commit 1da31679cb
1 changed files with 152 additions and 0 deletions
--- a/equinix/design/capability-systems.org
+++ b/equinix/design/capability-systems.org
@@ -0,0 +1,152 @@
+* Bootstrapping trust in a capability model
+
+There are two basic ways to start the chain of trust with a capability
+model, either the resource server is started with a set of root
+capabilities that governs all the resources, or ambient authority is
+used to provide the initial trust.
+
+Let's take the IP example further, some IPAM service is supposed to
+govern the RFC1918 space for Equinix. Its provides an API for
+downstream services to request blocks of arbitrary size, so they can
+further allocate smaller blocks from those blocks.
+
+I think the easiest way is just to use ACLs for the initial set of
+capabilities, and once the service is live, the majority of requests
+would be using wrapped resources. Let's say this IPAM service allows
+creation of "root" ranges through a create range API.
+An operator could create the range for 10.0.0.0/8. And then create a
+wrapped resource to delegate to downstream services.
+
+If MCN, Metal and Fabric are all interested in sharing this IP space,
+we could have the service request a IP range of a specific size. Then
+the operator could create wrapped resources for larger ranges for each
+of the business units, and then hand those to the operators for the
+MCN/Metal and Fabric services.
+
+Once the dependent service gets their wrapped resource, they can
+further divide the resources if they have multiple services that want
+to allocate from distinct pools within that space, or they can all
+share the capability as-is.
+
+The dependent service could then make direct calls to the IPAM service
+to make "assignments" in the IPAM service to mark that that IP is
+currently in use within the larger range.
+
+Eventually, we want to get away from this operator X does operation
+for operator Y, because it means that
+
+
+Let's assume we made an IPAM service that has the following endpoints:
+
+- CREATE IP Range
+  Adds an entry to allow the IPAM service to govern the range
+  Returns a resource ID
+- LIST IP Ranges
+  Lists all the ranges governed by the IPAM service
+- GET IP RANGE
+  Shows details about the IP range, such as how much of the range is
+  allocated.
+
+  Can be accessed by either by ACL, or capability
+
+- DELETE IP Range
+  Remove an IP range from being governed by the IPAM service
+
+- CREATE IP Range Request
+  Request a capability which lets a service allocate from this IP Range
+
+- GET/LIST IP Range Request
+  Show the status of a request
+
+- PUT IP Range Request
+  Allows approving/denying the request
+
+- DELETE IP Range Request
+  Removing an IP range request
+
+- CREATE IP Assignment
+  Only accepts a wrapped resource, marks IP Address or subnet as allocated.
+
+
+Now we consider how we get to be able to start using
+capabilities. Initially, an operator needs to start the service by
+creating some IP ranges that the IPAM service is responsible for. This
+endpoint can use ACLs to check that the operator has the authorization
+to create ranges, and then the service can allow requests.
+
+Next, some service, like the Metal Provisioner needs to assign IPs to
+instances so they can talk to each other over the private
+network. Initially the provisioner doesn't have access to any IP
+ranges, so it sends a request for a /16. That /16 request is then
+approved by an IPAM operator, and the provisioner receives a
+capability that allows manipulating assignments on that range.
+
+
+The IAM operator portion could be removed
+
+
+
+----
+
+IPAM Worked Example
+
+Let's assume we have an IPAM system which governs 10.0.0.0/8, and
+other IP blocks. We have a service, such as LBaaS which needs to
+assign Private IPs to customer Load balancer instances. The LBaaS
+service needs to assign unique IPs to the load balancer instances so
+that customer can route traffic to their metal instances.
+
+The LB service needs to reach out to the IPAM service to pull an IP,
+and to do that, it must request it within a block represented by a
+wrapped resource. So how does the service initially obtain this
+wrapped resource?
+
+On first startup, the LBaaS service knows it doesn't have the
+capability to assign IPs becasue it doesn't have a wrapped resource
+for the range. It reaches out authenticated as itself to the IPAM
+service, and requests a =/16=. That request is authorized just by the
+fact that the LB service has the correct audience to talk to the IPAM
+service.
+
+The request is recorded, and some approval process is done by the IPAM
+operators, or is determined by buisiness logic. Once approved, the
+wrapped resource for the requested range is issued to the LBaaS
+service, which it stores. Now, whenever an IP is needed, it makes an
+assignment under that wrapped resource.
+
+Internally, the IPAM service needs to record that a block is currently
+active, and that the capability sent to the LB service references
+it. As an example, let's say the 10.0.0.0/8 is represented by the root
+resource identifier `ntwkblk-a1b2c3`. When the LB service requests a
+=/16=, a new IP reservation resource is created `ntwkipr-xyzxyz`, and
+once approved, a capability is created, by calling,
+WrapResource(ntwkipr-xyzxyz, [create_assignment, read_assignment, delete_assignment],
+{}), which produces a wrapped resource with ID
+`ntwkipr-u8e82i.qeoalf` and the IPAM service distributes this back to
+the LB service.
+
+When the LB service wishes to record an assignment to that block, it
+can make a request to the IPAM services assignment endpoint,
+(e.g. POST /ip-reservations/ntwkipr-u8e82i.qeoalf/assignments). From
+there, the IPAM service calls, UnwrapResource(ntwkipr-u8e82i.qeoalf,
+[create_assignment], {}), which succeeds because the wrapped resource
+is valid, the verifier matches, and the operation is allowed for that
+ID. And the assignment is created.
+
+This example describes a manual approval process and doesn't
+necessarily describe how the async process is implemented for yieling
+the capability back to the requesting service. The manual approval
+process could easily be replaced by setting limits per identity, and
+requiring manual approval for higher limits, e.g. Any product can
+request a up to a /24, but if you want anything larger, you'll need
+manual approval by the governing team. In that case, the system
+becomes more dynamic and teams can self-serve their requests. The
+distribution of the capability must happen over a secure channel as
+well, such as a NATS topic that only the requesting service has access
+to, or by direct callback API.
+
+Futher delegation is possible as well, where the LB service could ask
+the IPAM service to wrap `ntwkipr-u8e82i-qeoalf` another time, but
+this time only to perform `read_assignment` and then the LB team can
+create operator tools to find details about the assignment from the
+IPAM service without having the ability to do damage.