Updates
This commit is contained in:
8
equinix/correction-of-errors/metal-2590.org
Normal file
8
equinix/correction-of-errors/metal-2590.org
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
4/30 18:29 - Incident 2590 created
|
||||||
|
4/30 18:34 - List posted of affected servers, zero core count causing issues billing for license activations
|
||||||
|
4/30 18:35 - Nautilus goalie asks what needs to be changed
|
||||||
|
4/30 18:37 - Ask is to have Nautilus engineer make prod data changes to allow a billing run to succeed
|
||||||
|
4/30 18:37 - Nautilus goalie tries to figure out if theres time to test before making the change
|
||||||
|
4/30 18:53 - Urgency is due to billing run set to start at 5/1 1:30 UTC
|
||||||
|
4/30 19:33 - Determined scope of issue to be Instances with OSes that require core counts for licenses
|
||||||
|
|
||||||
35
equinix/correction-of-errors/metal-3020.org
Normal file
35
equinix/correction-of-errors/metal-3020.org
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
#+TITLE:
|
||||||
|
|
||||||
|
|
||||||
|
2024-08-11 10:58 UTC - API 500s increased to 1500/min
|
||||||
|
2024-08-11 11:14 UTC - Nautilus Goalie paged for 500 errors
|
||||||
|
2024-08-11 11:23 UTC - Opened Incident 2030
|
||||||
|
2024-08-11 11:25 UTC - Rollbar errors indicate issues with memcached
|
||||||
|
2024-08-11 11:25 UTC - Honeycomb shows that all traffic is being served 500s
|
||||||
|
2024-08-11 11:26 UTC - Increased memcached memory limit in an attempt to resolve Out of Memory errors
|
||||||
|
2024-08-11 11:35 UTC - Called for status page
|
||||||
|
2024-08-11 11:53 UTC - Started to see successful responses for production traffic
|
||||||
|
2024-08-11 11:58 UTC - Re-occurrence of 500s
|
||||||
|
2024-08-11 12:00 UTC - Update from AppSec that Kona alerts for a attack on the API
|
||||||
|
2024-08-11 12:01 UTC - Cloudflare graphs posted that showed sharp drop in traffic at around 7:45 (not sure about granularity)
|
||||||
|
2024-08-11 12:09 UTC - Observed log line in splunk indicating timeouts when talking to memcached
|
||||||
|
2024-08-11 12:12 UTC - Noticed K8s probes failing and causing application restarts
|
||||||
|
2024-08-11 12:23 UTC - Posted graph of application cycling between healthy and not every 5 minutes
|
||||||
|
2024-08-11 12:25 UTC - Determined the liveness probes were failing and causing the restarts after 5 minutes
|
||||||
|
2024-08-11 12:26 UTC - Increased timeout to accommodate from 3s to 10s
|
||||||
|
2024-08-11 12:33 UTC - API served traffic for CF to bring origins back online
|
||||||
|
2024-08-11 12:36 UTC - Metal API is up and serving requests but most requests are timing out, P95 is 100x what it is normally
|
||||||
|
2024-08-11 13:33 UTC - Front end pods being removed from serving traffic by readiness probes failing
|
||||||
|
2024-08-11 13:33 UTC - Suspected issue with priming the cache, increased fronted pods to help alleviate request pressure
|
||||||
|
2024-08-11 13:33 UTC - Looking to determine root cause of network timeouts
|
||||||
|
2024-08-11 13:44 UTC - Posted memcache stats showing extremely high hit rate despite being nearly empty
|
||||||
|
2024-08-11 14:09 UTC - Determined logging on MemcacheD caused CPU throttling of the pod
|
||||||
|
2024-08-11 14:18 UTC - Reduced log level on memcached pods and saw CFS throttling resolve
|
||||||
|
2024-08-11 14:31 UTC - API back and serving requests for a short period
|
||||||
|
2024-08-11 15:01 UTC - Updated memcached item_size_max to address Value too large errors from Flipper
|
||||||
|
2024-08-11 15:50 UTC - Established Confidence in root cause
|
||||||
|
2024-08-11 16:04 UTC - API PR to disable caching feature flags in memcached
|
||||||
|
2024-08-11 17:20 UTC - Deploying API PR to remove caching feature flags
|
||||||
|
2024-08-11 18:08 UTC - Moved incident status to Monitoring
|
||||||
|
2024-08-11 18:12 UTC - Metal API up and responding with slightly higher P95
|
||||||
|
2024-08-11 22:29 UTC - Changed incident status to resolved
|
||||||
0
standup/identity.org
Normal file
0
standup/identity.org
Normal file
Reference in New Issue
Block a user