What's the best way to retrieve records through API when survey is over 30,000 records?

FernandoCagua · June 7, 2021, 10:52pm

Thanks @Kal_Lam: We don’t have any problem downloading using the User Interface. The problem arises only when accessing the data through the API.

As for the curl commands:

curl -H "Authorization: Token XXXX" "https://kc.humanitarianresponse.info/api/v1/data/XXXX?format=csv" works but only downloads the first 30,000 records
curl -H "Authorization: Token XXXX" "https://kc.humanitarianresponse.info/api/v1/data/XXXX?format=csv&start=30000" also works and downloads records from 30 to 60k, but as mentioned in the link posted by @stephanealoo there is no assurance that the records in this request will not overlap with those obtained in the first batch. To have that assurance we also need a sort parameter.
curl -H "Authorization: Token XXXX" 'https://kc.humanitarianresponse.info/api/v1/data/XXXX?format=csv&start=30000&sort={"_id":1}' as suggested in by @stephanealoo does not work and I get the bad gateway response shown in the first post.

I’ve taken care of encoding the url myself or letting curl doing it (using -G --data-urlencode) to get a path like api/v1/data/XXXX?format=csv&start=30000&sort=%7B%22_id%22%3A1%7D and it does not work either.

Below the verbose output of curl in case it’s helpful (I’ve edited out my survey id and token)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 63.34.253.220...
* TCP_NODELAY set
* Connected to kc.humanitarianresponse.info (63.34.253.220) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
} [242 bytes data]
* TLSv1.2 (IN), TLS handshake, Server hello (2):
{ [102 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [4916 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [333 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [70 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=kobo.humanitarianresponse.info
*  start date: Aug 14 00:00:00 2020 GMT
*  expire date: Sep 13 12:00:00 2021 GMT
*  subjectAltName: host "kc.humanitarianresponse.info" matched cert's "kc.humanitarianresponse.info"
*  issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
*  SSL certificate verify ok.
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fb0b2009600)
> GET /api/v1/data/XXX?format=csv&sort=%7B%22_id%22%3A1%7D&start=30000 HTTP/2
> Host: kc.humanitarianresponse.info
> User-Agent: curl/7.64.1
> Accept: */*
> Authorization: Token XXXX
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
  0     0    0     0    0     0      0      0 --:--:--  0:02:01 --:--:--     0< HTTP/2 502
< date: Mon, 07 Jun 2021 22:45:48 GMT
< content-type: text/html; charset=utf-8
< content-length: 158
< server: nginx/1.19.10
<
{ [158 bytes data]
100   158  100   158    0     0      1      0  0:02:38  0:02:02  0:00:36    45
* Connection #0 to host kc.humanitarianresponse.info left intact
* Closing connection 0

Thanks!