Hi all, I’m trying to retrieve csv data from the API but I’m failing to do that when the survey has more than 30,000 submissions. I found that, in theory, it should be possible to use MongoDB filters to sort and paginate the queries. However I’m having trouble getting it to work. Has anyone faced a similar challenge? Is there an alternative that can be used to get the same result?
For example when calling: https://kc.humanitarianresponse.info/api/v1/data/XXXXX?format=csv&start=30000&sort=%7B%22_id%22%3A1%7D (which should be the url encoding of sorting by submission id, {"_id": 1}) I get a bad gateway response (see below) after about a 30 seconds wait.
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.19.10</center>
</body>
</html>
Removing the sort parameter for the query (https://kc.humanitarianresponse.info/api/v1/data/XXXXX?format=csv&start=30000) is nevertheless successful. However, my understanding is that the sort parameter is important when using the start parameter.
Any pointer about what I might be doing wrong?
Not sure if it matters, but I’m using htmltools in R for the query but I get the exact same response if I use curl directly.
@FernandoCagua, we have had similar issues reported previously:
But we never heard back if 30k data could be downloaded normally using the UI i.e. using the DATA>Download for the KPI. Maybe you could help us update this through the post.
Besides, you also mentioned that the cURL command is also not working. Does it work if the data size is less than 30k? The community would also like to help you by testing the command line you have used to see if there are any issues there.
Thanks @Kal_Lam: We don’t have any problem downloading using the User Interface. The problem arises only when accessing the data through the API.
As for the curl commands:
curl -H "Authorization: Token XXXX" "https://kc.humanitarianresponse.info/api/v1/data/XXXX?format=csv" works but only downloads the first 30,000 records
curl -H "Authorization: Token XXXX" "https://kc.humanitarianresponse.info/api/v1/data/XXXX?format=csv&start=30000" also works and downloads records from 30 to 60k, but as mentioned in the link posted by @stephanealoo there is no assurance that the records in this request will not overlap with those obtained in the first batch. To have that assurance we also need a sort parameter.
curl -H "Authorization: Token XXXX" 'https://kc.humanitarianresponse.info/api/v1/data/XXXX?format=csv&start=30000&sort={"_id":1}' as suggested in by @stephanealoo does not work and I get the bad gateway response shown in the first post.
I’ve taken care of encoding the url myself or letting curl doing it (using -G --data-urlencode) to get a path like api/v1/data/XXXX?format=csv&start=30000&sort=%7B%22_id%22%3A1%7D and it does not work either.
Below the verbose output of curl in case it’s helpful (I’ve edited out my survey id and token)
Thanks @stephanealoo for linking to that resource. You might notice in my original post that I’ve already followed the instructions that @jnm suggested in the post you linked but they don’t seem to be working. Sorry I wasn’t clear enough!
I’ve detailed the curl commands and the responses I’m getting in my reponse to @Kal_Lam. Maybe that’s helpful information?
I’ve tried all sorts of things:
Reordering the query parameters
Requesting json instead of csv data
Using ascending or descending order in the sort parameter
Sorting by other fields like _uuid, date, etc.
Manually encoding the url and letting curl encode it
Tried at different times of the day
Tried from different machines, different OS
But none of that seem to make any difference.
I have only tried this in the humanitarian server. Maybe the sorting query/request takes too long to complete and so it’s being cancelled/killed by the server?
I would appreciate if you have any further insights. Thanks again!
Hi @FernandoCagua, there have been instances where queries on large datasets time out and can be caused by heavy load on the servers — we are in the process of trying to increase the efficiency of those queries. Have you tried querying the api/v2 endpoints for the JSON data? No guarantee that it will solve the issue, but worth testing.
Thanks @Josh. I tried the api/v2 endpoints. Unfortunately, the sort parameter fails as well. Our survey has about 34k records only so I’m surprised that the servers are getting overworked with our request.
We will probably stick to the api/v1 endpoints as the requests in api/v2 seem to fail way more often.
When we do the request without the sort parameter the requests succeed and it appears that the ids are unique across “pages”. Do you think it is OK to ignore sort parameter and still get unique records?
Hi @FernandoCagua, when you say that “the sort parameter fails as well”, do you mean that the request times out or that there are other issues? Yes, it is concerning that you are having issues with that amount of records — hopefully the update in the coming day will solve this.
When you say that “requests in api/v2 seem to fail way more often”, can you clarify what issues you tend to have with the v2 endpoints?
Can you please also clarify what you mean by “it appears that the ids are unique across “pages””. The sort parameter should have no impact on uniqueness.