This commit is contained in:
Adam Mohammed
2024-11-12 10:13:44 -05:00
parent 63fe9cf740
commit 181b9a7bc3
3 changed files with 43 additions and 0 deletions

View File

@@ -0,0 +1,8 @@
4/30 18:29 - Incident 2590 created
4/30 18:34 - List posted of affected servers, zero core count causing issues billing for license activations
4/30 18:35 - Nautilus goalie asks what needs to be changed
4/30 18:37 - Ask is to have Nautilus engineer make prod data changes to allow a billing run to succeed
4/30 18:37 - Nautilus goalie tries to figure out if theres time to test before making the change
4/30 18:53 - Urgency is due to billing run set to start at 5/1 1:30 UTC
4/30 19:33 - Determined scope of issue to be Instances with OSes that require core counts for licenses

View File

@@ -0,0 +1,35 @@
#+TITLE:
2024-08-11 10:58 UTC - API 500s increased to 1500/min
2024-08-11 11:14 UTC - Nautilus Goalie paged for 500 errors
2024-08-11 11:23 UTC - Opened Incident 2030
2024-08-11 11:25 UTC - Rollbar errors indicate issues with memcached
2024-08-11 11:25 UTC - Honeycomb shows that all traffic is being served 500s
2024-08-11 11:26 UTC - Increased memcached memory limit in an attempt to resolve Out of Memory errors
2024-08-11 11:35 UTC - Called for status page
2024-08-11 11:53 UTC - Started to see successful responses for production traffic
2024-08-11 11:58 UTC - Re-occurrence of 500s
2024-08-11 12:00 UTC - Update from AppSec that Kona alerts for a attack on the API
2024-08-11 12:01 UTC - Cloudflare graphs posted that showed sharp drop in traffic at around 7:45 (not sure about granularity)
2024-08-11 12:09 UTC - Observed log line in splunk indicating timeouts when talking to memcached
2024-08-11 12:12 UTC - Noticed K8s probes failing and causing application restarts
2024-08-11 12:23 UTC - Posted graph of application cycling between healthy and not every 5 minutes
2024-08-11 12:25 UTC - Determined the liveness probes were failing and causing the restarts after 5 minutes
2024-08-11 12:26 UTC - Increased timeout to accommodate from 3s to 10s
2024-08-11 12:33 UTC - API served traffic for CF to bring origins back online
2024-08-11 12:36 UTC - Metal API is up and serving requests but most requests are timing out, P95 is 100x what it is normally
2024-08-11 13:33 UTC - Front end pods being removed from serving traffic by readiness probes failing
2024-08-11 13:33 UTC - Suspected issue with priming the cache, increased fronted pods to help alleviate request pressure
2024-08-11 13:33 UTC - Looking to determine root cause of network timeouts
2024-08-11 13:44 UTC - Posted memcache stats showing extremely high hit rate despite being nearly empty
2024-08-11 14:09 UTC - Determined logging on MemcacheD caused CPU throttling of the pod
2024-08-11 14:18 UTC - Reduced log level on memcached pods and saw CFS throttling resolve
2024-08-11 14:31 UTC - API back and serving requests for a short period
2024-08-11 15:01 UTC - Updated memcached item_size_max to address Value too large errors from Flipper
2024-08-11 15:50 UTC - Established Confidence in root cause
2024-08-11 16:04 UTC - API PR to disable caching feature flags in memcached
2024-08-11 17:20 UTC - Deploying API PR to remove caching feature flags
2024-08-11 18:08 UTC - Moved incident status to Monitoring
2024-08-11 18:12 UTC - Metal API up and responding with slightly higher P95
2024-08-11 22:29 UTC - Changed incident status to resolved