roam/daily/2021-09-03.org
2021-09-14 23:50:54 -04:00

2.2 KiB

2021-09-03

CP crashes

Looking into Control Panel OOM issues in Kubernetes.

Average of ~138mb memory usage per request.

  ps aufx |grep apache | awk '{print "cat /proc/" $1 "/statm"}' | sh | grep -v open | awk '{print $0}'

Memory limit should account for the 256MB APC cache.

rps = 32 ram per worker = 150 MB minimum workers = rps => 32 num workers = minimum workers * 2 => 64 desired pod count = 8 workers per pod = num workers / desired pod count => 8 apc ram = 256 MB ram per pod = ram per worker * workers per pod + apc ram => 1,456 MB

rps = 32 ram per worker = 150 MB minimum workers = rps => 32 num workers = minimum workers * 2 => 64 desired pod count = 8 workers per pod = num workers / desired pod count * 2 => 16 apc ram = 256 MB ram per pod = ram per worker * workers per pod + apc ram => 2,656 MB

Making assumptions on baseline requests per second and duration of request.

active workers = 20 * 8 => 160 apc ram = 256 MB ram per worker = 150 MB pod count = 8 workers per pod = active workers / pod count => 20 ram per pod = workers per pod * ram per worker => 3,000 MB

active workers = 20 * 8 * 2 => 320 apc ram = 256 MB ram per worker = 150 MB pod count = 20 workers per pod = active workers / pod count => 16 ram per pod = workers per pod * ram per worker => 2,400 MB

Josh Benner doing math on likely requirements

Actions taken:

  • Made per-pod worker count configurable via consul (easier to change if needed)
  • Tuned pod worker counts and memory allocations based on memory usage metrics (helps avoid OOMKills)
  • Removed the k8s liveness probe (avoids k8s-kills when loaded)
  • Revised the k8s readiness probe to just check TCP socket availability (avoids removing pod from rotation during load)
  • Load-tested these changes in staging: confirmed that pods stay up under load, and scaling more readily mitigates pod
  • Monitoring added to draw attention to load-induced symptoms of control-panel pods.
  • Auto-scaling will be investigated next week as an additional load mitigation tool.

(details: 32 workers per pod, 1Gi RAM per pod (for now), 8 pods)