Self-hosted KoBoToolbox randomly "shutting down" (504 error)

boris · June 3, 2022, 7:58am

Hello,

Just an update on my last message: it appears that after deactivating the cheaper algorithm and restarting KoBoCAT + KPI with ./run.py -cf up -d --force-recreate kpi kobocat nginx, the surveys are not loading anymore.

Then, restarting Kobo with ./run.py makes it freeze at the last step when waiting for environment to be ready:

I had to revert the changes with ./run.py --setup (i.e. reactivate the cheaper algorithm) and it works again (until it crashes in a few days ).

Thanks in advance,

Regards,
Boris

yjouanique · June 4, 2022, 4:14pm

Just want to echo that our setup also crashes very frequently with no useful logs to investigate (like ~10 times/day - our traffic is pretty high).

Evidently some uWSGI tuning might be required, but we don’t really have much expertise to track this down, so we haven’t tried some of these tuning parameters advised above.

The impact is very limited since containers are automatically respawned by our infrastructure (we run on kubernetes), and we run 2 nodes anyway so the traffic is instantly rebalanced, so it’s not a huge deal but is still something we’d like to fix at some point.

We’ll give it a try and will report back, but a tuning guide for self-hosters might be a useful thing to do.

OlivierL · June 6, 2022, 3:26pm

There is something wrong because you should see some workers on KoBoCAT too.
Why is the list of your workers empty?

Are your screenshots with cheaper deactivated or activated? (I guess it’s with cheaper on but I want to confirm).

Can you confirm which container needs to be restarted to make the app work again?

boris · June 7, 2022, 2:45pm

Hi,

There is something wrong because you should see some workers on KoBoCAT too.
Why is the list of your workers empty?

@OlivierL, well, now I do see some workers listed in KoBoCAT. Honestly no idea why no one was showing up when I took the previous screenshots:

Are your screenshots with cheaper deactivated or activated?

All screenshots were taken with cheaper activated, as deactivating them resulted in some weird behaviour mentioned in my previous message (i.e. surveys not loading, impossible to restart with ./run.py).

Can you confirm which container needs to be restarted to make the app work again?

I’ll need to wait the next 504 error to confirm that but I think restarting KPI solves the issue (I’m now restarting regularly with ./run.py -cf restart kpi kobocat nginx which does the trick).

Thanks also @yjouanique for your message. I agree that having some tuning guide would be great. Your balanced infrastructure looks to be a great workaround but in our case we would definitely like to resolve the underlying issue before using this kind of architecture to scale Kobo’s usage rather than preventing random 504 errors

If that can help, Kobo is running on a Debian 11 server behind a Nginx proxy which forwards traffic to the Docker containers.

@OlivierL do you think having Kobo running on its own dedicated server could help? (I mean, if any sort of conflict between Kobo and our proxy or whatever could be the origin of this issue?).
Otherwise as we are really struggling with the issue, is there any way to “hire” an expert Kobo consultant from to help our team with that specific issue?
I assume that if Kobo runs and scales well everywhere except on a few instances as @yjouanique’s and mine, maybe the issue is related to our server, configuration, etc.?

OlivierL · June 7, 2022, 6:28pm

Hello @boris,
Thanks for the screenshots.

do you think having Kobo running on its own dedicated server could help? (I mean, if any sort of conflict between Kobo and our proxy or whatever could be the origin of this issue?).

One of our servers is running exactly the same setup. One nginx proxy which forwards traffic to many kobo-docker/kobo-install containers. It’s using Ubuntu 20.04 but I’m pretty sure it does not matter. It could be because of the host ulimit but I think Ubuntu and Debian share the same default settings.

Just in case, can you show us what your ulimit settings?

ulimit -Sa ## Show soft limit ##
ulimit -Ha ## Show hard limit ##

How many requests do you receive per minute?

BTW, The app should work with cheaper deactivated. I’ll have a look on my side.

boris · June 8, 2022, 8:29am

Hello @OlivierL,

You’re right that the OS should not change anything as Docker containerises the apps. Here are the current server’s ulimit settings:

ulimit -Sa

real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 63235
max locked memory           (kbytes, -l) 2027700
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 63235
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

ulimit -Ha

real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 63235
max locked memory           (kbytes, -l) 2027700
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1048576
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) unlimited
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 63235
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

How many requests do you receive per minute?

The server usually receives around 80 requests / minute; raising up to 200 requests / minute at peak usage.

BTW, The app should work with cheaper deactivated. I’ll have a look on my side.

Thank you!

boris · June 9, 2022, 8:58am

Hi @OlivierL,

Quick update regarding my last message as we just experienced a new 504 error / shutdown:

Restarting only the KPI container makes Kobo working again
Below are new uwsgitop taken once the 504 error occurred:

Thank you!

OlivierL · June 9, 2022, 1:35pm

@boris , can you manually upgrade uwsgi inside the KPI container? pip install --upgrade uwsgi.
I see that the version is 2.0.18 whereas KoBoCAT container is 2.0.19.1.

I’m still trying on my side to reproduce your problem but it did not happen after few days.

boris · June 9, 2022, 2:02pm

Just updated it on both KPI and KoBoCAT so that they are running under the same version uwsgi-2.0.20.
I will keep you posted if it seems to solve the issue.

OlivierL · June 14, 2022, 2:24pm

@boris, I don’t think it will help. I made some tests on my side and could reproduce the problem even with the latest version of uWSGI.

It seems to be a race condition in Python which makes uWSGI hang.
Somebody opened an issue on GH uWSGI hangs sometimes · Issue #3566 · kobotoolbox/kpi · GitHub. They suggest to use lazy-apps = True. According to my tests, it seems to work with this option enabled.

Unfortunately, the cons of using lazy-apps=True, none of the resources are shared between the workers, so when you have lots of workers, there could be a bad side effects of lots of memory usage.

We’ll try to figure out if we can find a solution to fix this but as a short term workaround, you can try the lazy-apps=True option.

boris · June 14, 2022, 3:19pm

Hi @OlivierL,

I am glad to hear that you have been able to reproduce the issue! That’s a great step towards long-term fix I guess

I have added lazy-apps=True in the uwsgi.ini configuration of both KPI and KoBoCat. I will let you know if it seems to solve the issue in a couple of days.

In the meantime, can I ask you if there is a straightforward way to save this configuration parameter for future restarts of Kobo?
For now I basically added this lazy-apps parameter by opening a bash session in both KPI and KoBoCat containers, edited the uwsgi.ini file and finally restarted the containers individually.

Thank you for your help!

OlivierL · June 14, 2022, 3:34pm

That’s a great step towards long-term fix I guess

Indeed.

In the meantime, can I ask you if there is a straightforward way to save this configuration parameter for future restarts of Kobo?

The easiest solution would be to use the new custom feature of kobo-install.

Make a copy of KPI and KoBoCAT uWSGI ini file (outside of the containers, i.e. on the host drive).
Add docker-compose.frontend.custom.yml in your kobo-docker folder containing the following content.

version: '3'

services:
  kobocat:
    volumes:
        - path.to.kc.uwsgi.ini:/srv/src/kobocat/docker/kobocat.ini

  kpi:
    volumes:
        - path.to.kpi.uwsgi.ini:/srv/src/kpi/uwsgi.ini

(Please validate the syntax. I’ve just written down this without testing it)

Then, run kobo-install setup (./run.py --setup) and choose advanced option.
At the end, you should be asked:

Do you want to add additional settings to the front-end docker containers?

Choose 1 (Yes).

manu_j · June 16, 2022, 2:02pm

. @OlivierL That somebody who opened the GH issue is me

I had commented on this issue earlier with the fix (setting lazy-apps=True) Self-hosted KoBoToolbox randomly "shutting down" (504 error) - #4 by manu_j

Before this change kobo used to crash at least once a day and after this the setup has been rock solid.