63 lines
2.2 KiB
Org Mode
63 lines
2.2 KiB
Org Mode
:PROPERTIES:
|
|
:ID: 30330ca0-3a44-4b56-8bce-6f6cce9ab115
|
|
:END:
|
|
#+title: 2021-09-03
|
|
|
|
* CP crashes
|
|
Looking into [[id:57ee2f00-9bcd-4e0f-8a77-ae1f2d4cda89][Control Panel]] OOM issues in Kubernetes.
|
|
|
|
Average of ~138mb memory usage per request.
|
|
|
|
#+begin_example
|
|
ps aufx |grep apache | awk '{print "cat /proc/" $1 "/statm"}' | sh | grep -v open | awk '{print $0}'
|
|
#+end_example
|
|
|
|
Memory limit should account for the 256MB APC cache.
|
|
|
|
#+CAPTION: Josh Benner doing math on likely requirements
|
|
#+begin_quote
|
|
rps = 32
|
|
ram per worker = 150 MB
|
|
minimum workers = rps => 32
|
|
num workers = minimum workers * 2 => 64
|
|
desired pod count = 8
|
|
workers per pod = num workers / desired pod count => 8
|
|
apc ram = 256 MB
|
|
ram per pod = ram per worker * workers per pod + apc ram => 1,456 MB
|
|
|
|
rps = 32
|
|
ram per worker = 150 MB
|
|
minimum workers = rps => 32
|
|
num workers = minimum workers * 2 => 64
|
|
desired pod count = 8
|
|
workers per pod = num workers / desired pod count * 2 => 16
|
|
apc ram = 256 MB
|
|
ram per pod = ram per worker * workers per pod + apc ram => 2,656 MB
|
|
|
|
Making assumptions on baseline requests per second and duration of request.
|
|
|
|
active workers = 20 * 8 => 160
|
|
apc ram = 256 MB
|
|
ram per worker = 150 MB
|
|
pod count = 8
|
|
workers per pod = active workers / pod count => 20
|
|
ram per pod = workers per pod * ram per worker => 3,000 MB
|
|
|
|
active workers = 20 * 8 * 2 => 320
|
|
apc ram = 256 MB
|
|
ram per worker = 150 MB
|
|
pod count = 20
|
|
workers per pod = active workers / pod count => 16
|
|
ram per pod = workers per pod * ram per worker => 2,400 MB
|
|
#+end_quote
|
|
|
|
Actions taken:
|
|
- Made per-pod worker count configurable via consul (easier to change if needed)
|
|
- Tuned pod worker counts and memory allocations based on memory usage metrics (helps avoid OOMKills)
|
|
- Removed the k8s liveness probe (avoids k8s-kills when loaded)
|
|
- Revised the k8s readiness probe to just check TCP socket availability (avoids removing pod from rotation during load)
|
|
- Load-tested these changes in staging: confirmed that pods stay up under load, and scaling more readily mitigates pod
|
|
- Monitoring added to draw attention to load-induced symptoms of control-panel pods.
|
|
- Auto-scaling will be investigated next week as an additional load mitigation tool.
|
|
|
|
(details: 32 workers per pod, 1Gi RAM per pod (for now), 8 pods)
|