:PROPERTIES:
:ID:       30330ca0-3a44-4b56-8bce-6f6cce9ab115
:END:
#+title: 2021-09-03

* CP crashes
Looking into [[id:57ee2f00-9bcd-4e0f-8a77-ae1f2d4cda89][Control Panel]] OOM issues in Kubernetes.

Average of ~138mb memory usage per request.

#+begin_example
  ps aufx |grep apache | awk '{print "cat /proc/" $1 "/statm"}' | sh | grep -v open | awk '{print $0}'
#+end_example

Memory limit should account for the 256MB APC cache.

#+CAPTION: Josh Benner doing math on likely requirements
#+begin_quote
rps = 32
ram per worker = 150 MB
minimum workers = rps => 32
num workers = minimum workers * 2 => 64
desired pod count = 8
workers per pod = num workers / desired pod count => 8
apc ram = 256 MB
ram per pod = ram per worker * workers per pod + apc ram => 1,456 MB

rps = 32
ram per worker = 150 MB
minimum workers = rps => 32
num workers = minimum workers * 2 => 64
desired pod count = 8
workers per pod = num workers / desired pod count * 2 => 16
apc ram = 256 MB
ram per pod = ram per worker * workers per pod + apc ram => 2,656 MB

Making assumptions on baseline requests per second and duration of request.

active workers = 20 * 8 => 160
apc ram = 256 MB
ram per worker = 150 MB
pod count = 8
workers per pod = active workers / pod count => 20
ram per pod = workers per pod * ram per worker => 3,000 MB

active workers = 20 * 8 * 2 => 320
apc ram = 256 MB
ram per worker = 150 MB
pod count = 20
workers per pod = active workers / pod count => 16
ram per pod = workers per pod * ram per worker => 2,400 MB
#+end_quote

Actions taken:
- Made per-pod worker count configurable via consul (easier to change if needed)
- Tuned pod worker counts and memory allocations based on memory usage metrics (helps avoid OOMKills)
- Removed the k8s liveness probe (avoids k8s-kills when loaded)
- Revised the k8s readiness probe to just check TCP socket availability (avoids removing pod from rotation during load)
- Load-tested these changes in staging: confirmed that pods stay up under load, and scaling more readily mitigates pod
- Monitoring added to draw attention to load-induced symptoms of control-panel pods.
- Auto-scaling will be investigated next week as an additional load mitigation tool.

(details: 32 workers per pod, 1Gi RAM per pod (for now), 8 pods)