diff --git a/equinix/correction-of-errors/metal-2590.org b/equinix/correction-of-errors/metal-2590.org new file mode 100644 index 0000000..a272f77 --- /dev/null +++ b/equinix/correction-of-errors/metal-2590.org @@ -0,0 +1,8 @@ +4/30 18:29 - Incident 2590 created +4/30 18:34 - List posted of affected servers, zero core count causing issues billing for license activations +4/30 18:35 - Nautilus goalie asks what needs to be changed +4/30 18:37 - Ask is to have Nautilus engineer make prod data changes to allow a billing run to succeed +4/30 18:37 - Nautilus goalie tries to figure out if theres time to test before making the change +4/30 18:53 - Urgency is due to billing run set to start at 5/1 1:30 UTC +4/30 19:33 - Determined scope of issue to be Instances with OSes that require core counts for licenses + diff --git a/equinix/correction-of-errors/metal-3020.org b/equinix/correction-of-errors/metal-3020.org new file mode 100644 index 0000000..a7f6785 --- /dev/null +++ b/equinix/correction-of-errors/metal-3020.org @@ -0,0 +1,35 @@ +#+TITLE: + + +2024-08-11 10:58 UTC - API 500s increased to 1500/min +2024-08-11 11:14 UTC - Nautilus Goalie paged for 500 errors +2024-08-11 11:23 UTC - Opened Incident 2030 +2024-08-11 11:25 UTC - Rollbar errors indicate issues with memcached +2024-08-11 11:25 UTC - Honeycomb shows that all traffic is being served 500s +2024-08-11 11:26 UTC - Increased memcached memory limit in an attempt to resolve Out of Memory errors +2024-08-11 11:35 UTC - Called for status page +2024-08-11 11:53 UTC - Started to see successful responses for production traffic +2024-08-11 11:58 UTC - Re-occurrence of 500s +2024-08-11 12:00 UTC - Update from AppSec that Kona alerts for a attack on the API +2024-08-11 12:01 UTC - Cloudflare graphs posted that showed sharp drop in traffic at around 7:45 (not sure about granularity) +2024-08-11 12:09 UTC - Observed log line in splunk indicating timeouts when talking to memcached +2024-08-11 12:12 UTC - Noticed K8s probes failing and causing application restarts +2024-08-11 12:23 UTC - Posted graph of application cycling between healthy and not every 5 minutes +2024-08-11 12:25 UTC - Determined the liveness probes were failing and causing the restarts after 5 minutes +2024-08-11 12:26 UTC - Increased timeout to accommodate from 3s to 10s +2024-08-11 12:33 UTC - API served traffic for CF to bring origins back online +2024-08-11 12:36 UTC - Metal API is up and serving requests but most requests are timing out, P95 is 100x what it is normally +2024-08-11 13:33 UTC - Front end pods being removed from serving traffic by readiness probes failing +2024-08-11 13:33 UTC - Suspected issue with priming the cache, increased fronted pods to help alleviate request pressure +2024-08-11 13:33 UTC - Looking to determine root cause of network timeouts +2024-08-11 13:44 UTC - Posted memcache stats showing extremely high hit rate despite being nearly empty +2024-08-11 14:09 UTC - Determined logging on MemcacheD caused CPU throttling of the pod +2024-08-11 14:18 UTC - Reduced log level on memcached pods and saw CFS throttling resolve +2024-08-11 14:31 UTC - API back and serving requests for a short period +2024-08-11 15:01 UTC - Updated memcached item_size_max to address Value too large errors from Flipper +2024-08-11 15:50 UTC - Established Confidence in root cause +2024-08-11 16:04 UTC - API PR to disable caching feature flags in memcached +2024-08-11 17:20 UTC - Deploying API PR to remove caching feature flags +2024-08-11 18:08 UTC - Moved incident status to Monitoring +2024-08-11 18:12 UTC - Metal API up and responding with slightly higher P95 +2024-08-11 22:29 UTC - Changed incident status to resolved diff --git a/standup/identity.org b/standup/identity.org new file mode 100644 index 0000000..e69de29