Excessive load, uWSGI workers, VM compute resources, 502/504 errors

tolexy · August 25, 2020, 3:41pm

I have had KoBo running for quite some time now on our own servers and it has been wonderful. This community and all of the supporting members have been great through the process. I have come across some interesting issues over the past year and hoping someone can help with this one.

Periodically, our servers will “go down.” I actually have three instances running. Prod and dev both in Azure env, and another in AWS lightsail. Each of the three will periodically throw a 502 or 504 error from nginx and the whole site will go down until I restart all the containers using python3 ./kobo-install/run.py The production version goes down far more often and appears it MIGHT be load related. I havent actually tried seeing if I restart nginx container instead of the whole container set to see if that helps. Has anybody come across this on their own servers?

My first lead was that this was somehow load-based. We had a user unknowingly hitting the API with massive simultaneous redundant queries that would end up breaking the server, requiring us to restart the KoBo containers. They would pull ~100,000 records before they (or we) knew that there was a 30,000 row limit. Guessing that was a timeout issue, though an error message from the API could be useful.
Then this happened on the instance in lightsail which has MAYBE 10 users, none of whom know what an API is, so the excessive load theory was put into question.
I saw a post in git about the uWSGI workers and noticed UNOCHA upped their worker number to 24. I dont think we can do this as our VM is only 4 core, but as the default is 2, i decided to up ours to 3 to see what would happen. I think standard practice is 2 workers per core?
Do I really need to set up a load-balancer to address these types of issues?

If anyone has any ideas of how we can avoid the 502/504 errors occurring, that would be very helpful!

OlivierL · August 25, 2020, 7:25pm

Hi @tolexy,

First of all, thank you for your interest in the project.

Every request doesn’t have the same weight on server resources, so it really depends on what users are doing with your setup.

These tasks (but not only those) can have a significant impact of the rest of the app.

Exports (KPI or legacy)
Submitting lots of data at once (especially with attachments)
Processing complex forms (enketo/redis)

Even we do our best to optimize the code, some part of the code can be a bottleneck and can run slowly under certain circumstances. It can block all other requests and make the whole app unresponsive. BTW, if you identify such part of code or a specific behaviour that slows down the app, do not hesitate to open a PR on a respective repository.

We want to implement submission throttling based on specific rules to help #2. But we are a small team so we cannot tell when it will be released.

One rule is always true: The more requests you receive, the more resources you need.
UNOCHA, for example, frontends (m5.xlarge - 4 vCPU and 16 GB RAM) are configured to start with 10 uWSGi workers and allow up to 24 workers. As you see, we don’t follow the rules of 2 workers per core. Tweaking the number of workers helped a lot for 502 errors.
You can also separate frontends from the backend to spread the load on each server
But even with several frontends and big backend servers, we still face 502 errors.

Have you tried to tweak PostgreSQL settings too?

tolexy · August 26, 2020, 1:41pm

This is has been super helpful! Admittedly I had even opened let alone tweaked the uWSGI or Postgres settings til now, but I think these were exactly what I was looking for. If it wasn’t broken I didn’t go looking for fixes.

Bumped uWSGI to start at 10 and max at 16. We are using a VM in Azure equivalently resourced to the OCHA on AWS. We started getting a lot more requests so really hoping bumping up these resources helps a bit. Doubt that we are getting used as much as OCHA at least for now.
Doubled the RAM for postgres 2 -> 4.
Changed hdd to sdd to match our VM config. What does this actually do in the settings or where can I see how those effect settings?
Planning to separate the front and backend in the very near future, to account for load balancing, redundancy, failover etc.
502 errors: Do ALL the containers need to be restarted? I set up an alert system through one of our other apps that pings service_health for each of our three instances every 15mins and sends an email alert if response !=200. Then we go in and restart all the containers. Is there a better way? Does your team have a vm-local script that checks service_health and restarts all or just certain containers?

Thanks for the helpful info!

OlivierL · August 26, 2020, 6:41pm

#1 To be precise, we have set up uWSGI with these settings:
- 2048 requests per worker
- 7 GB RAM max per worker
- 120s timeout

#3 The settings are tweaked with the help of pgconfig.org API. You can control there are applied with select name, setting from pg_settings; . Be aware some settings need PostgreSQL to be restarted.

#5 Most of the time, kobocat container is often the most overwhelmed and returns 502, but if one container starts to be unresponsive, all others behave the same way in manner of time.
We do have a cron task that monitor containers health every minute. If they don’t respond after X minutes, we restart all of them.

tolexy · August 26, 2020, 7:15pm

Thanks, for the config info. I will take those into consideration as we scale up. I dont think we can hit the 7GB/worker at the moment wiht only 16GB of RAM Will take a look at the pgconfig.org API. All super helpful ! cheers