Outage of Global instance, May 9, 2026

Dear KoboToolbox community,

We are once again facing an AWS-related outage on kf.kobotoolbox.org. Our team has been working since the outage began at 11:10 UTC and is currently on the phone with AWS engineering staff.

We sincerely apologize for this disruption. We will restore service as quickly as possible and provide more information as it becomes available.

1 Like

We believe that the Global instance (kf.kobotoolbox.org, kc.kobotoolbox.org, ee.kobotoolbox.org) is once again operating correctly. We are continuing to monitor it and will post more details shortly.

Please respond here if you are still facing trouble.

2 Likes

In my haste to post this while working on the outage, I incorrectly wrote “May 10” in the topic title. I regret the error and have now corrected this.

2 Likes

Thank you Sir

1 Like

What happened

All the workers running the KoboToolbox application (Kubernetes pods) abruptly stopped being able to make network connections to our database server.

We found that the “security group” feature in AWS (effectively a firewall) was not applying any configuration changes, despite not showing any errors. Kubernetes is a dynamic system, automatically creating and removing resources in response to demand and to faults; this is great for resilience but requires that “security group” network configuration updates be applied frequently. When that underlying Amazon infrastructure fails, the system ceases to operate.

How we worked around it

By changing the instance type of the database server, we were able (presumably) to trigger Amazon’s infrastructure to move it to new hardware, or at least away from whatever subsystem was failing. This allowed security groups to be configured properly again, in turn allowing traffic to pass between the worker pods and the database server.

What AWS said

Their previous problem due to an overheated data center is not actually resolved. After escalating through two engineers and pushing to be notified about when the problems would be fully fixed, they said that someone from their backend team would follow up with us with more details at a later time. At this point, we had worked around the problem ourselves and were focused on bringing the service back online.

What we’re doing

After two unacceptable outages in a row, we’re proactively replicating our Global databases to a different part of the AWS infrastructure. We are starting now, but this process may take multiple days due to the large size of the databases. In itself, the replica creation will happen behind the scenes with no need to interrupt the KoboToolbox service. It is our hope that AWS does not continue to experience disruptive problems, but if they do, we hope this replication will have already completed so that we can immediately switch to the replica, thus avoiding any extended outage.

Are data at risk?

We always retain a current backup of all data in a completely separate system (Amazon S3). It is up-to-the-minute using continuous PostgreSQL WAL archiving. This is a great safeguard but unfortunately is not fast to restore from scratch, which is why we are starting a parallel replica that can be used immediately if needed.

Submissions that cannot be uploaded while the service is down remain safely stored locally on data collection devices. If ever there is a KoboToolbox outage, safeguard these devices until service is restored and they have fully finished uploading their queue of submissions.

Thank you

Thanks as always for your patience and support of the KoboToolbox open-source platform and community.

3 Likes