KoBoToolbox Data Warehouse

mattlangeman · April 20, 2021, 7:37pm

I would appreciate feedback related to the following:

Are there current solutions that I have missed in my research?
Am I missing any big challenges?
Is my concept and approach reasonable?

Background

I am working with an organization that makes extensive use of a self-hosted version of KoBoToolbox. Overall we are very happy with the functionality. However, we have had some issues when using analytics tools. For large numbers of submissions, we have had issues with API row limits and Power BI performance. The format of data coming out of the API also limits the analytics tools that are easy to hookup.

We are looking for ways to make our data more accessible to a variety of analytics tools.

Goals

Top Goal: Get all submission data out of KoBoToolbox into a format that is easy for data analysis.

Data analysts may use a variety of tools: Excel, PowerBI, Tableau, Superset, Metabase, etc.
Hitting the API repeatedly/frequently for large datasets can be problematic.

Requirements

Nested data should be flattened
We should keep useful data types: Date, Datetime, Time, Decimal, etc.

Out of current scope

media files (photos, videos, audio recordings)
permissions of data in warehouse

Past Community Discussion

I have looked through the community forum and it appears as though this idea was discussed a bit in the past.

REST Services vs API

Rest Services

Not a good option because:

Must be setup individually for every project/survey
Only trigger on new submissions
Do not trigger on:
- New project/survey creation
- Edit to project/survey
- Edit to submission

API v2

Appears to be the best option.

Project/Surveys - can get all projects/surveys and all past versions

includes data structure of projects/survey

Submission Data - can get all current submission data for project/survey. Does not include history of edits

_id unique ID for the submission that does not change for edits
_submission_time (this is the first submission and does not update on edit)
__version__ of project/survey
meta/deprecatedID is set for edited submission

Challenges

Changes to project/survey

A project/survey can be changed and redeployed. This may change the structure of the data.

Options:

Store data for each version in a different table.

create an additional view or table to combine data in Postgres

Add new columns to existing table when structure changes

Edits to data

A submission can be edited. This may change the project/survey version associated with the submission.

Options:

update row with new data (no history of previous data)
create a new row and use latest column to identify the most recent row

Mapping data types

Data type can be important during analysis and should be mapped appropriately. Maybe tedious, but hopefully not a major challenge.

Flattening Data

Data for some question types (example: select_multiple) should be “flattened” to produce multiple columns. Maybe tedious, but hopefully not a major challenge.

Concept and Approach

On a scheduled basis pull data via the API and put it into a Postgres database that has flattened, typed data.

Process for pulling in data from the API

Step 1

Get all project/surveys and all past versions via API

For each survey version
- create a new database table based survey structure
  - use type field to map to appropriate DB field(s)
  - flatten into multiple columns:
    - select_multiple, create new column for each option/choice

###Step 2.

For each project/survey:

get all new submissions since last pull (via _submission_time)
- may also filter submission that happened with last X minutes if need to avoid issue of survey/table not being part of Step 1
- may need to deal with pagiation if over 30K new records
- add submission to table based on version
  - with set latest column to true
get all edited submission via meta/depreciatedID
- may need to deal with pagiation if over 30K edied records
- for each edited submission
  - _id is ID used for all instances of this submission
  - assert: an instance of submission with _id should be in DB
  - if edited submission uuid == uuid of latest instance in DB
    - already have this edit. do nothing
  - else
    - for all tables associated with the survey
      - set lastest to False for an rows of _id
      - add edited submission to table based on survey version
        
        with latest column set to True

Step 3.

For each project/survey create an all_submissions view.

Example: suppose there was a v1 and v2 survey
- v2:
  - added last_name
  - removed date_of_birth
  - added visited_provice_pei
  - changed number_of_children to number_of_dependents

  create view all_submissions as
  
  select 
  auto_id, survey_id, submission_id, first_name, null as last_name, date_of_birth, number_of_children, null as number_of_dependents, accept_terms, favorite_color, 
  visited_province_alberta, visited_province_bc, visited_province_manitoba, null as visited_province_pei
  from data_warehouse_submissionv1 where latest = true
  
  UNION
  
  select 
  auto_id, survey_id, submission_id, first_name, last_name, null as date_of_birth, null as number_of_children, number_of_dependents, accept_terms, favorite_color, 
  visited_province_alberta, visited_province_bc, visited_province_manitoba, visited_province_pei
  from data_warehouse_submissionv2 where latest = true;

Each submission table could have multiple rows with the same submission_id (if submission was edited). But there should only be one latest submission across all tables.
Would not attempt to detect name changes (renaming). Best to leave them as two separate columns and let the analyst choose to merge if desired.
Would have to compare both name and type. If name is the same, but type is changed, would need prefix or suffix to create separate columns.

wroos · April 20, 2021, 8:40pm

Hello,
Just to add for further discussions. We - and probably other users - work rather successfull with download as xls (XML headers and values) and SPSS (or Stata) for Analysis.

Flattening of repeats would not be preferable for us and multiple selects are fine (combined plus seperated in download). already. But data types could be improved.

Josh · April 23, 2021, 5:53pm

Hi @mattlangeman, thank you for sharing your thought process here and please let us know how things work out.

Can you please elaborate on this:

The format of data coming out of the API also limits the analytics tools that are easy to hookup

It will be good to know what can be improved to make these integrations more seamless.

ks_1 · April 25, 2021, 4:40pm

@Josh
I’m not sure what he meant, but what I’ve noticed is that few dashboard tools offer connectivity with APIs. So we’re severely limited in the analytics tools we can use.

Josh · April 26, 2021, 5:23pm

@ks_1 if that is the case, what is the preferred method of connecting external data sources into a dashboard tool?

ks_1 · April 26, 2021, 5:56pm

Here’s the data source list of Grafana, a popular open source analytics tool:

Josh · April 26, 2021, 6:00pm

@ks_1, Grafana does at least offer plugins that allow for JSON API connections:

Although I haven’t tested this with our API.

ks_1 · April 26, 2021, 6:03pm

I think our guys tried this but they ran into some problems. Let me try and find out what the exact issue was.

mattlangeman · April 28, 2021, 4:06pm

@Josh

I know that some users have had issues dealing with the 30K limit of the API and also with Power BI performance when pulling in large datasets from the API.

The excel/CSV exports are great for one-off analysis, but harder to use when trying to automate pulling data into analysis/dashboarding tools.

We have explored using Apache Superset and a few other dashboarding tools. At the time, Superset had some support for CSV uploading, but didn’t have a way to automatically/directly pull data. I do see Superset has an open ticket for MongoDB support but have not investigated closely

Most dashboarding tools have support for Postgres. My thinking was a Postgres database (with each survey in a separate tables), would provide a lot of flexibility.

Josh · May 4, 2021, 4:28pm

Hi @mattlangeman,

Thanks for clarifying some of the difficulties. There are unfortunately also limitations when using the CSV export in this case if you have repeat groups in the form (although hopefully we’ll get around to addressing this soon):

Using a separate table for each survey should work fine though if you are using a self-hosted instance of Kobo, are you not able to connect directly with the postgres database located at kobo-docker/.vols/db?

ks_1 · October 30, 2021, 10:05pm

I was able to connect directly to Postgres with Power BI after exposing the backend container ports. But I’m not able to get DirectQuery to work because of the format of the data. DirectQuery is what makes a dashboard live, otherwise we have to refresh the connections manually.

Has anyone found a way to work around this?

chris · November 21, 2022, 10:41am

Hi @mattlangeman , we have been supported by another Matt (at www.opencitieslab.org) who has written an Apache Airflow tool to place KoBo records into a Postgresql database adjacent to a Metabase instance that we use to query the data. Still early but we have one project moving smoothly. Just need to replicate…