org-notes/equinix/watch/integration.org

 #+TITLE: Integrating Equinix Metal API with Equinix Watch
#+AUTHOR: Adam Mohammed

* Problem

Equinix Watch has defined the format for which they want to ingest
auditable events. They chose the OTLP as the protocol for
ingesting these events from services restricting their ingestion to
just the logging signal.

Normally when sending data to a collector, you would make use of the
OpenTelemetry libraries to make it easy to grab metadata about the
request and surrounding environment, without needing to manually
cobble that data together. Unfortunately, using OTEL logging as the
only signal that Equinix Watch accepts, makes adoption needlessly
painful. Ruby does not have a stable client library for OTEL logs, and
neither does Golang.

Most of the spec provided by EquinixWatch does not actually relate to
the log that we would like to provide to the customer. OTEL Logging
aims to make this simple by using the Baggage and Context APIs to
enrich the log records with information about the surrounding
environment and context. Again, the implementations for these are
incomplete and not production ready.

Until the OTEL libraries provide support for the context and baggage
propogation in the Logs API/SDK, this will data will need to be
extracted and formatted specifically for Equinix Watch, meaning the
burden of integration is higher than it needs to be. If we end up
doing this, we'll probably just fetch the same data from the span
attributes anyway, to keep things consistent.

There's absolutely no reason to do this work when we can add the logs
in a structured way to the trace and pass that through to their custom
collector. By doing this we don't need to wait for the OTEL libraries
to provide logging implementations that do what traces already
provide.

The only reason I can see not to do this is that it makes Equinix
Watch have to handle translating trace information to a format that
can be delivered to their end targets. I'd argue that's going to need
to happen anyway, so why not make use of all the wonderful tools we
have to enrich the data you have as input, so you can build complete
and interesting audit logs for you end user.

* Concerns

- Alex: Yeahhhh I've gotta say I'm uncomfortable making our existing
  OTEL collector, which is right now part of our internal tooling, and
  making it part of the critical path for customer data with Equinix
  Watch.

  I don't understand this, of course you're going to be in your
  critical path. I'm not saying to use your collector as the ONLY
  collector, this is why we even have collectors. We are able to
  configure where the data are exported.

- Alex: IMO internal traces are implementation details that are
  subject to change and there are too many things that could go
  wrong. What happens if the format of those traces changes due to
  some library upgrade, or if there's memory pressure and we start
  sampling events or something?

  Traces being implementation details - like audit logs? There's a
  reason we use standard libraries to instrument our traces. These
  libraries follow OTEL Semantic Conventions so we have stable and
  consistent span attributes that track data across services.

  Memory pressure, this isn't solved by OLTP at all, in fact
  collectors will refuse spans if they're experiencing memory pressure
  to prevent getting OOMKilled. This is not an application concern,
  this is an monitoring concern. You should know if your collector is

- Alex: In my experience, devs in general have a higher tolerance for
  gaps and breakage in their internal tooling than what I'm willing to
  have for customer-facing audit logs.

  This is just poor form. If you don't trust the applications that
  integrate with your application, what do you trust?

- Alex: I think customer-facing observability is net-new functionality
  and, for the time being, I'm OK with putting a higher burden on
  applications producing that data than "flip a flag in the collector
  to redirect part of the firehose to Equinix Watch

  Net-new - sure, I agree

  Higher burden on applications producing the data - why though? we
  can provide you a higher quality data source already instead of
  hand-rolling an implementation to the the logs signal

  "flip a flag in the collector" - I think this just shows illiteracy,
  but we are able to control what parts are shipped to your fragile
  collector.