What happened
All the workers running the KoboToolbox application (Kubernetes pods) abruptly stopped being able to make network connections to our database server.
We found that the “security group” feature in AWS (effectively a firewall) was not applying any configuration changes, despite not showing any errors. Kubernetes is a dynamic system, automatically creating and removing resources in response to demand and to faults; this is great for resilience but requires that “security group” network configuration updates be applied frequently. When that underlying Amazon infrastructure fails, the system ceases to operate.
How we worked around it
By changing the instance type of the database server, we were able (presumably) to trigger Amazon’s infrastructure to move it to new hardware, or at least away from whatever subsystem was failing. This allowed security groups to be configured properly again, in turn allowing traffic to pass between the worker pods and the database server.
What AWS said
Their previous problem due to an overheated data center is not actually resolved. After escalating through two engineers and pushing to be notified about when the problems would be fully fixed, they said that someone from their backend team would follow up with us with more details at a later time. At this point, we had worked around the problem ourselves and were focused on bringing the service back online.
What we’re doing
After two unacceptable outages in a row, we’re proactively replicating our Global databases to a different part of the AWS infrastructure. We are starting now, but this process may take multiple days due to the large size of the databases. In itself, the replica creation will happen behind the scenes with no need to interrupt the KoboToolbox service. It is our hope that AWS does not continue to experience disruptive problems, but if they do, we hope this replication will have already completed so that we can immediately switch to the replica, thus avoiding any extended outage.
Are data at risk?
We always retain a current backup of all data in a completely separate system (Amazon S3). It is up-to-the-minute using continuous PostgreSQL WAL archiving. This is a great safeguard but unfortunately is not fast to restore from scratch, which is why we are starting a parallel replica that can be used immediately if needed.
Submissions that cannot be uploaded while the service is down remain safely stored locally on data collection devices. If ever there is a KoboToolbox outage, safeguard these devices until service is restored and they have fully finished uploading their queue of submissions.
Thank you
Thanks as always for your patience and support of the KoboToolbox open-source platform and community.