Updates

2024-11-12 10:13:44 -05:00
parent 63fe9cf740
commit 181b9a7bc3
3 changed files with 43 additions and 0 deletions
--- a/equinix/correction-of-errors/metal-2590.org
+++ b/equinix/correction-of-errors/metal-2590.org
@@ -0,0 +1,8 @@
+4/30 18:29 - Incident 2590 created
+4/30 18:34 - List posted of affected servers, zero core count causing issues billing for license activations
+4/30 18:35 - Nautilus goalie asks what needs to be changed
+4/30 18:37 - Ask is to have Nautilus engineer make prod data changes to allow a billing run to succeed
+4/30 18:37 - Nautilus goalie tries to figure out if theres time to test before making the change
+4/30 18:53 - Urgency is due to billing run set to start at 5/1 1:30 UTC
+4/30 19:33 - Determined scope of issue to be Instances with OSes that require core counts for licenses
+
--- a/equinix/correction-of-errors/metal-3020.org
+++ b/equinix/correction-of-errors/metal-3020.org
@@ -0,0 +1,35 @@
+#+TITLE:
+
+
+2024-08-11 10:58 UTC - API 500s increased to 1500/min
+2024-08-11 11:14 UTC - Nautilus Goalie paged for 500 errors
+2024-08-11 11:23 UTC - Opened Incident 2030
+2024-08-11 11:25 UTC - Rollbar errors indicate issues with memcached
+2024-08-11 11:25 UTC - Honeycomb shows that all traffic is being served 500s
+2024-08-11 11:26 UTC - Increased memcached memory limit in an attempt to resolve Out of Memory errors
+2024-08-11 11:35 UTC - Called for status page
+2024-08-11 11:53 UTC - Started to see successful responses for production traffic
+2024-08-11 11:58 UTC - Re-occurrence of 500s
+2024-08-11 12:00 UTC - Update from AppSec that Kona alerts for a attack on the API
+2024-08-11 12:01 UTC - Cloudflare graphs posted that showed sharp drop in traffic at around 7:45 (not sure about granularity)
+2024-08-11 12:09 UTC - Observed log line in splunk indicating timeouts when talking to memcached
+2024-08-11 12:12 UTC - Noticed K8s probes failing and causing application restarts
+2024-08-11 12:23 UTC - Posted graph of application cycling between healthy and not every 5 minutes
+2024-08-11 12:25 UTC - Determined the liveness probes were failing and causing the restarts after 5 minutes
+2024-08-11 12:26 UTC - Increased timeout to accommodate from 3s to 10s
+2024-08-11 12:33 UTC - API served traffic for CF to bring origins back online
+2024-08-11 12:36 UTC - Metal API is up and serving requests but most requests are timing out, P95 is 100x what it is normally
+2024-08-11 13:33 UTC - Front end pods being removed from serving traffic by readiness probes failing
+2024-08-11 13:33 UTC - Suspected issue with priming the cache, increased fronted pods to help alleviate request pressure
+2024-08-11 13:33 UTC - Looking to determine root cause of network timeouts
+2024-08-11 13:44 UTC - Posted memcache stats showing extremely high hit rate despite being nearly empty
+2024-08-11 14:09 UTC - Determined logging on MemcacheD caused CPU throttling of the pod
+2024-08-11 14:18 UTC - Reduced log level on memcached pods and saw CFS throttling resolve
+2024-08-11 14:31 UTC - API back and serving requests for a short period
+2024-08-11 15:01 UTC - Updated memcached item_size_max to address Value too large errors from Flipper
+2024-08-11 15:50 UTC - Established Confidence in root cause
+2024-08-11 16:04 UTC - API PR to disable caching feature flags in memcached
+2024-08-11 17:20 UTC - Deploying API PR to remove caching feature flags
+2024-08-11 18:08 UTC - Moved incident status to Monitoring
+2024-08-11 18:12 UTC - Metal API up and responding with slightly higher P95
+2024-08-11 22:29 UTC - Changed incident status to resolved