:PROPERTIES: :ID: 30330ca0-3a44-4b56-8bce-6f6cce9ab115 :END: #+title: 2021-09-03 * CP crashes Looking into [[id:57ee2f00-9bcd-4e0f-8a77-ae1f2d4cda89][Control Panel]] OOM issues in Kubernetes. Average of ~138mb memory usage per request. #+begin_example ps aufx |grep apache | awk '{print "cat /proc/" $1 "/statm"}' | sh | grep -v open | awk '{print $0}' #+end_example Memory limit should account for the 256MB APC cache. #+CAPTION: Josh Benner doing math on likely requirements #+begin_quote rps = 32 ram per worker = 150 MB minimum workers = rps => 32 num workers = minimum workers * 2 => 64 desired pod count = 8 workers per pod = num workers / desired pod count => 8 apc ram = 256 MB ram per pod = ram per worker * workers per pod + apc ram => 1,456 MB rps = 32 ram per worker = 150 MB minimum workers = rps => 32 num workers = minimum workers * 2 => 64 desired pod count = 8 workers per pod = num workers / desired pod count * 2 => 16 apc ram = 256 MB ram per pod = ram per worker * workers per pod + apc ram => 2,656 MB Making assumptions on baseline requests per second and duration of request. active workers = 20 * 8 => 160 apc ram = 256 MB ram per worker = 150 MB pod count = 8 workers per pod = active workers / pod count => 20 ram per pod = workers per pod * ram per worker => 3,000 MB active workers = 20 * 8 * 2 => 320 apc ram = 256 MB ram per worker = 150 MB pod count = 20 workers per pod = active workers / pod count => 16 ram per pod = workers per pod * ram per worker => 2,400 MB #+end_quote Actions taken: - Made per-pod worker count configurable via consul (easier to change if needed) - Tuned pod worker counts and memory allocations based on memory usage metrics (helps avoid OOMKills) - Removed the k8s liveness probe (avoids k8s-kills when loaded) - Revised the k8s readiness probe to just check TCP socket availability (avoids removing pod from rotation during load) - Load-tested these changes in staging: confirmed that pods stay up under load, and scaling more readily mitigates pod - Monitoring added to draw attention to load-induced symptoms of control-panel pods. - Auto-scaling will be investigated next week as an additional load mitigation tool. (details: 32 workers per pod, 1Gi RAM per pod (for now), 8 pods)