2021-09-03

CP crashes

CP crashes

Looking into Control Panel OOM issues in Kubernetes.

Average of ~138mb memory usage per request.

  ps aufx |grep apache | awk '{print "cat /proc/" $1 "/statm"}' | sh | grep -v open | awk '{print $0}'

Memory limit should account for the 256MB APC cache.

rps = 32 ram per worker = 150 MB minimum workers = rps => 32 num workers = minimum workers * 2 => 64 desired pod count = 8 workers per pod = num workers / desired pod count => 8 apc ram = 256 MB ram per pod = ram per worker * workers per pod + apc ram => 1,456 MB

rps = 32 ram per worker = 150 MB minimum workers = rps => 32 num workers = minimum workers * 2 => 64 desired pod count = 8 workers per pod = num workers / desired pod count * 2 => 16 apc ram = 256 MB ram per pod = ram per worker * workers per pod + apc ram => 2,656 MB

Making assumptions on baseline requests per second and duration of request.

active workers = 20 * 8 => 160 apc ram = 256 MB ram per worker = 150 MB pod count = 8 workers per pod = active workers / pod count => 20 ram per pod = workers per pod * ram per worker => 3,000 MB

active workers = 20 * 8 * 2 => 320 apc ram = 256 MB ram per worker = 150 MB pod count = 20 workers per pod = active workers / pod count => 16 ram per pod = workers per pod * ram per worker => 2,400 MB

Josh Benner doing math on likely requirements

Actions taken:

Made per-pod worker count configurable via consul (easier to change if needed)
Tuned pod worker counts and memory allocations based on memory usage metrics (helps avoid OOMKills)
Removed the k8s liveness probe (avoids k8s-kills when loaded)
Revised the k8s readiness probe to just check TCP socket availability (avoids removing pod from rotation during load)
Load-tested these changes in staging: confirmed that pods stay up under load, and scaling more readily mitigates pod
Monitoring added to draw attention to load-induced symptoms of control-panel pods.
Auto-scaling will be investigated next week as an additional load mitigation tool.

(details: 32 workers per pod, 1Gi RAM per pod (for now), 8 pods)

2.2 KiB Raw Blame History

2021-09-03

CP crashes

2.2 KiB

Raw Blame History