#+TITLE: 2024-08-11 10:58 UTC - API 500s increased to 1500/min 2024-08-11 11:14 UTC - Nautilus Goalie paged for 500 errors 2024-08-11 11:23 UTC - Opened Incident 2030 2024-08-11 11:25 UTC - Rollbar errors indicate issues with memcached 2024-08-11 11:25 UTC - Honeycomb shows that all traffic is being served 500s 2024-08-11 11:26 UTC - Increased memcached memory limit in an attempt to resolve Out of Memory errors 2024-08-11 11:35 UTC - Called for status page 2024-08-11 11:53 UTC - Started to see successful responses for production traffic 2024-08-11 11:58 UTC - Re-occurrence of 500s 2024-08-11 12:00 UTC - Update from AppSec that Kona alerts for a attack on the API 2024-08-11 12:01 UTC - Cloudflare graphs posted that showed sharp drop in traffic at around 7:45 (not sure about granularity) 2024-08-11 12:09 UTC - Observed log line in splunk indicating timeouts when talking to memcached 2024-08-11 12:12 UTC - Noticed K8s probes failing and causing application restarts 2024-08-11 12:23 UTC - Posted graph of application cycling between healthy and not every 5 minutes 2024-08-11 12:25 UTC - Determined the liveness probes were failing and causing the restarts after 5 minutes 2024-08-11 12:26 UTC - Increased timeout to accommodate from 3s to 10s 2024-08-11 12:33 UTC - API served traffic for CF to bring origins back online 2024-08-11 12:36 UTC - Metal API is up and serving requests but most requests are timing out, P95 is 100x what it is normally 2024-08-11 13:33 UTC - Front end pods being removed from serving traffic by readiness probes failing 2024-08-11 13:33 UTC - Suspected issue with priming the cache, increased fronted pods to help alleviate request pressure 2024-08-11 13:33 UTC - Looking to determine root cause of network timeouts 2024-08-11 13:44 UTC - Posted memcache stats showing extremely high hit rate despite being nearly empty 2024-08-11 14:09 UTC - Determined logging on MemcacheD caused CPU throttling of the pod 2024-08-11 14:18 UTC - Reduced log level on memcached pods and saw CFS throttling resolve 2024-08-11 14:31 UTC - API back and serving requests for a short period 2024-08-11 15:01 UTC - Updated memcached item_size_max to address Value too large errors from Flipper 2024-08-11 15:50 UTC - Established Confidence in root cause 2024-08-11 16:04 UTC - API PR to disable caching feature flags in memcached 2024-08-11 17:20 UTC - Deploying API PR to remove caching feature flags 2024-08-11 18:08 UTC - Moved incident status to Monitoring 2024-08-11 18:12 UTC - Metal API up and responding with slightly higher P95 2024-08-11 22:29 UTC - Changed incident status to resolved