roam/daily/2021-09-03.org

:PROPERTIES:
:ID:       30330ca0-3a44-4b56-8bce-6f6cce9ab115
:END:
#+title: 2021-09-03

* CP crashes
Looking into [[id:57ee2f00-9bcd-4e0f-8a77-ae1f2d4cda89][Control Panel]] OOM issues in Kubernetes.

Average of ~138mb memory usage per request.

#+begin_example
  ps aufx |grep apache | awk '{print "cat /proc/" $1 "/statm"}' | sh | grep -v open | awk '{print $0}'
#+end_example

Memory limit should account for the 256MB APC cache.

#+CAPTION: Josh Benner doing math on likely requirements
#+begin_quote
rps = 32
ram per worker = 150 MB
minimum workers = rps => 32
num workers = minimum workers * 2 => 64
desired pod count = 8
workers per pod = num workers / desired pod count => 8
apc ram = 256 MB
ram per pod = ram per worker * workers per pod + apc ram => 1,456 MB

rps = 32
ram per worker = 150 MB
minimum workers = rps => 32
num workers = minimum workers * 2 => 64
desired pod count = 8
workers per pod = num workers / desired pod count * 2 => 16
apc ram = 256 MB
ram per pod = ram per worker * workers per pod + apc ram => 2,656 MB

Making assumptions on baseline requests per second and duration of request.

active workers = 20 * 8 => 160
apc ram = 256 MB
ram per worker = 150 MB
pod count = 8
workers per pod = active workers / pod count => 20
ram per pod = workers per pod * ram per worker => 3,000 MB

active workers = 20 * 8 * 2 => 320
apc ram = 256 MB
ram per worker = 150 MB
pod count = 20
workers per pod = active workers / pod count => 16
ram per pod = workers per pod * ram per worker => 2,400 MB
#+end_quote

Actions taken:
- Made per-pod worker count configurable via consul (easier to change if needed)
- Tuned pod worker counts and memory allocations based on memory usage metrics (helps avoid OOMKills)
- Removed the k8s liveness probe (avoids k8s-kills when loaded)
- Revised the k8s readiness probe to just check TCP socket availability (avoids removing pod from rotation during load)
- Load-tested these changes in staging: confirmed that pods stay up under load, and scaling more readily mitigates pod
- Monitoring added to draw attention to load-induced symptoms of control-panel pods.
- Auto-scaling will be investigated next week as an additional load mitigation tool.

(details: 32 workers per pod, 1Gi RAM per pod (for now), 8 pods)
updates 2021-09-15 03:50:54 +00:00			`:PROPERTIES:`
			`:ID: 30330ca0-3a44-4b56-8bce-6f6cce9ab115`
			`:END:`
			`#+title: 2021-09-03`

			`* CP crashes`
			`Looking into [[id:57ee2f00-9bcd-4e0f-8a77-ae1f2d4cda89][Control Panel]] OOM issues in Kubernetes.`

			`Average of ~138mb memory usage per request.`

			`#+begin_example`
			`ps aufx \|grep apache \| awk '{print "cat /proc/" $1 "/statm"}' \| sh \| grep -v open \| awk '{print $0}'`
			`#+end_example`

			`Memory limit should account for the 256MB APC cache.`

			`#+CAPTION: Josh Benner doing math on likely requirements`
			`#+begin_quote`
			`rps = 32`
			`ram per worker = 150 MB`
			`minimum workers = rps => 32`
			`num workers = minimum workers * 2 => 64`
			`desired pod count = 8`
			`workers per pod = num workers / desired pod count => 8`
			`apc ram = 256 MB`
			`ram per pod = ram per worker * workers per pod + apc ram => 1,456 MB`

			`rps = 32`
			`ram per worker = 150 MB`
			`minimum workers = rps => 32`
			`num workers = minimum workers * 2 => 64`
			`desired pod count = 8`
			`workers per pod = num workers / desired pod count * 2 => 16`
			`apc ram = 256 MB`
			`ram per pod = ram per worker * workers per pod + apc ram => 2,656 MB`

			`Making assumptions on baseline requests per second and duration of request.`

			`active workers = 20 * 8 => 160`
			`apc ram = 256 MB`
			`ram per worker = 150 MB`
			`pod count = 8`
			`workers per pod = active workers / pod count => 20`
			`ram per pod = workers per pod * ram per worker => 3,000 MB`

			`active workers = 20 * 8 * 2 => 320`
			`apc ram = 256 MB`
			`ram per worker = 150 MB`
			`pod count = 20`
			`workers per pod = active workers / pod count => 16`
			`ram per pod = workers per pod * ram per worker => 2,400 MB`
			`#+end_quote`

			`Actions taken:`
			`- Made per-pod worker count configurable via consul (easier to change if needed)`
			`- Tuned pod worker counts and memory allocations based on memory usage metrics (helps avoid OOMKills)`
			`- Removed the k8s liveness probe (avoids k8s-kills when loaded)`
			`- Revised the k8s readiness probe to just check TCP socket availability (avoids removing pod from rotation during load)`
			`- Load-tested these changes in staging: confirmed that pods stay up under load, and scaling more readily mitigates pod`
			`- Monitoring added to draw attention to load-induced symptoms of control-panel pods.`
			`- Auto-scaling will be investigated next week as an additional load mitigation tool.`

			`(details: 32 workers per pod, 1Gi RAM per pod (for now), 8 pods)`