What's the best way to retrieve records through API when survey is over 30,000 records?

FernandoCagua · June 7, 2021, 3:09am

Hi all, I’m trying to retrieve csv data from the API but I’m failing to do that when the survey has more than 30,000 submissions. I found that, in theory, it should be possible to use MongoDB filters to sort and paginate the queries. However I’m having trouble getting it to work. Has anyone faced a similar challenge? Is there an alternative that can be used to get the same result?

For example when calling: https://kc.humanitarianresponse.info/api/v1/data/XXXXX?format=csv&start=30000&sort=%7B%22_id%22%3A1%7D (which should be the url encoding of sorting by submission id, {"_id": 1}) I get a bad gateway response (see below) after about a 30 seconds wait.

<html>
    <head><title>502 Bad Gateway</title></head>
    <body>
    <center><h1>502 Bad Gateway</h1></center>
    <hr><center>nginx/1.19.10</center>
    </body>
</html>

Removing the sort parameter for the query (https://kc.humanitarianresponse.info/api/v1/data/XXXXX?format=csv&start=30000) is nevertheless successful. However, my understanding is that the sort parameter is important when using the start parameter.

Any pointer about what I might be doing wrong?

Not sure if it matters, but I’m using htmltools in R for the query but I get the exact same response if I use curl directly.

Thanks!

Kal_Lam · June 7, 2021, 3:25am

@FernandoCagua, we have had similar issues reported previously:

But we never heard back if 30k data could be downloaded normally using the UI i.e. using the DATA>Download for the KPI. Maybe you could help us update this through the post.

Besides, you also mentioned that the cURL command is also not working. Does it work if the data size is less than 30k? The community would also like to help you by testing the command line you have used to see if there are any issues there.

stephanealoo · June 7, 2021, 6:49pm

Hi,
Would this work for you

Regards,
Stephane

FernandoCagua · June 7, 2021, 10:52pm

Thanks @Kal_Lam: We don’t have any problem downloading using the User Interface. The problem arises only when accessing the data through the API.

As for the curl commands:

curl -H "Authorization: Token XXXX" "https://kc.humanitarianresponse.info/api/v1/data/XXXX?format=csv" works but only downloads the first 30,000 records
curl -H "Authorization: Token XXXX" "https://kc.humanitarianresponse.info/api/v1/data/XXXX?format=csv&start=30000" also works and downloads records from 30 to 60k, but as mentioned in the link posted by @stephanealoo there is no assurance that the records in this request will not overlap with those obtained in the first batch. To have that assurance we also need a sort parameter.
curl -H "Authorization: Token XXXX" 'https://kc.humanitarianresponse.info/api/v1/data/XXXX?format=csv&start=30000&sort={"_id":1}' as suggested in by @stephanealoo does not work and I get the bad gateway response shown in the first post.

I’ve taken care of encoding the url myself or letting curl doing it (using -G --data-urlencode) to get a path like api/v1/data/XXXX?format=csv&start=30000&sort=%7B%22_id%22%3A1%7D and it does not work either.

Below the verbose output of curl in case it’s helpful (I’ve edited out my survey id and token)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 63.34.253.220...
* TCP_NODELAY set
* Connected to kc.humanitarianresponse.info (63.34.253.220) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
} [242 bytes data]
* TLSv1.2 (IN), TLS handshake, Server hello (2):
{ [102 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [4916 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [333 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=kobo.humanitarianresponse.info
*  start date: Aug 14 00:00:00 2020 GMT
*  expire date: Sep 13 12:00:00 2021 GMT
*  subjectAltName: host "kc.humanitarianresponse.info" matched cert's "kc.humanitarianresponse.info"
*  issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
*  SSL certificate verify ok.
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fb0b2009600)
> GET /api/v1/data/XXX?format=csv&sort=%7B%22_id%22%3A1%7D&start=30000 HTTP/2
> Host: kc.humanitarianresponse.info
> User-Agent: curl/7.64.1
> Accept: */*
> Authorization: Token XXXX
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
  0     0    0     0    0     0      0      0 --:--:--  0:02:01 --:--:--     0< HTTP/2 502
< date: Mon, 07 Jun 2021 22:45:48 GMT
< content-type: text/html; charset=utf-8
< content-length: 158
< server: nginx/1.19.10
<
{ [158 bytes data]
100   158  100   158    0     0      1      0  0:02:38  0:02:02  0:00:36    45
* Connection #0 to host kc.humanitarianresponse.info left intact
* Closing connection 0

Thanks!

FernandoCagua · June 7, 2021, 11:08pm

Thanks @stephanealoo for linking to that resource. You might notice in my original post that I’ve already followed the instructions that @jnm suggested in the post you linked but they don’t seem to be working. Sorry I wasn’t clear enough!

I’ve detailed the curl commands and the responses I’m getting in my reponse to @Kal_Lam. Maybe that’s helpful information?

I’ve tried all sorts of things:

Reordering the query parameters
Requesting json instead of csv data
Using ascending or descending order in the sort parameter
Sorting by other fields like _uuid, date, etc.
Manually encoding the url and letting curl encode it
Tried at different times of the day
Tried from different machines, different OS

But none of that seem to make any difference.

I have only tried this in the humanitarian server. Maybe the sorting query/request takes too long to complete and so it’s being cancelled/killed by the server?

I would appreciate if you have any further insights. Thanks again!

Kal_Lam · June 8, 2021, 8:40am

@FernandoCagua, thank you for the detailed descriptions. Pinging @Josh here for technical support.

FernandoCagua · June 9, 2021, 8:27pm

Thanks so much @Kal_Lam. Looking forward for @Josh input!

Josh · June 9, 2021, 9:48pm

Hi @FernandoCagua, there have been instances where queries on large datasets time out and can be caused by heavy load on the servers — we are in the process of trying to increase the efficiency of those queries. Have you tried querying the api/v2 endpoints for the JSON data? No guarantee that it will solve the issue, but worth testing.

FernandoCagua · June 14, 2021, 4:30am

Thanks @Josh. I tried the api/v2 endpoints. Unfortunately, the sort parameter fails as well. Our survey has about 34k records only so I’m surprised that the servers are getting overworked with our request.

We will probably stick to the api/v1 endpoints as the requests in api/v2 seem to fail way more often.

When we do the request without the sort parameter the requests succeed and it appears that the ids are unique across “pages”. Do you think it is OK to ignore sort parameter and still get unique records?

Josh · June 14, 2021, 6:38pm

Hi @FernandoCagua, when you say that “the sort parameter fails as well”, do you mean that the request times out or that there are other issues? Yes, it is concerning that you are having issues with that amount of records — hopefully the update in the coming day will solve this.

When you say that “requests in api/v2 seem to fail way more often”, can you clarify what issues you tend to have with the v2 endpoints?

Can you please also clarify what you mean by “it appears that the ids are unique across “pages””. The sort parameter should have no impact on uniqueness.

Josh · June 29, 2021, 3:39pm

Hi @FernandoCagua, can you please test this on your side again. We’ve made some performance improvements that should hopefully solve this issue.

soubhi28 · August 8, 2022, 6:12am

Hi @FernandoCagua ,

I have tried it with json format and worked successfully:
https://kc.humanitarianresponse.info/api/v1/data/xxxxxxx?format=json&sort={“_id”:1}&start=30000