I would appreciate feedback related to the following:
- Are there current solutions that I have missed in my research?
- Am I missing any big challenges?
- Is my concept and approach reasonable?
Background
I am working with an organization that makes extensive use of a self-hosted version of KoBoToolbox. Overall we are very happy with the functionality. However, we have had some issues when using analytics tools. For large numbers of submissions, we have had issues with API row limits and Power BI performance. The format of data coming out of the API also limits the analytics tools that are easy to hookup.
We are looking for ways to make our data more accessible to a variety of analytics tools.
Goals
Top Goal: Get all submission data out of KoBoToolbox into a format that is easy for data analysis.
- Data analysts may use a variety of tools: Excel, PowerBI, Tableau, Superset, Metabase, etc.
- Hitting the API repeatedly/frequently for large datasets can be problematic.
Requirements
- Nested data should be flattened
- We should keep useful data types: Date, Datetime, Time, Decimal, etc.
Out of current scope
- media files (photos, videos, audio recordings)
- permissions of data in warehouse
Past Community Discussion
I have looked through the community forum and it appears as though this idea was discussed a bit in the past.
REST Services vs API
Rest Services
Not a good option because:
- Must be setup individually for every project/survey
- Only trigger on new submissions
- Do not trigger on:
- New project/survey creation
- Edit to project/survey
- Edit to submission
API v2
Appears to be the best option.
Project/Surveys - can get all projects/surveys and all past versions
- includes data structure of projects/survey
Submission Data - can get all current submission data for project/survey. Does not include history of edits
- _id unique ID for the submission that does not change for edits
- _submission_time (this is the first submission and does not update on edit)
- __version__ of project/survey
- meta/deprecatedID is set for edited submission
Challenges
Changes to project/survey
A project/survey can be changed and redeployed. This may change the structure of the data.
Options:
- Store data for each version in a different table.
- create an additional view or table to combine data in Postgres
- Add new columns to existing table when structure changes
Edits to data
A submission can be edited. This may change the project/survey version associated with the submission.
Options:
- update row with new data (no history of previous data)
- create a new row and use latest column to identify the most recent row
Mapping data types
Data type can be important during analysis and should be mapped appropriately. Maybe tedious, but hopefully not a major challenge.
Flattening Data
Data for some question types (example: select_multiple) should be “flattened” to produce multiple columns. Maybe tedious, but hopefully not a major challenge.
Concept and Approach
On a scheduled basis pull data via the API and put it into a Postgres database that has flattened, typed data.
Process for pulling in data from the API
Step 1
Get all project/surveys and all past versions via API
- For each survey version
- create a new database table based survey structure
- use type field to map to appropriate DB field(s)
- flatten into multiple columns:
- select_multiple, create new column for each option/choice
- create a new database table based survey structure
###Step 2.
For each project/survey:
- get all new submissions since last pull (via _submission_time)
- may also filter submission that happened with last X minutes if need to avoid issue of survey/table not being part of Step 1
- may need to deal with pagiation if over 30K new records
- add submission to table based on version
- with set latest column to true
- get all edited submission via meta/depreciatedID
- may need to deal with pagiation if over 30K edied records
- for each edited submission
- _id is ID used for all instances of this submission
- assert: an instance of submission with _id should be in DB
- if edited submission uuid == uuid of latest instance in DB
- already have this edit. do nothing
- else
- for all tables associated with the survey
- set lastest to False for an rows of _id
- add edited submission to table based on survey version
- with latest column set to True
- for all tables associated with the survey
Step 3.
For each project/survey create an all_submissions view.
- Example: suppose there was a v1 and v2 survey
- v2:
- added last_name
- removed date_of_birth
- added visited_provice_pei
- changed number_of_children to number_of_dependents
- v2:
create view all_submissions as
select
auto_id, survey_id, submission_id, first_name, null as last_name, date_of_birth, number_of_children, null as number_of_dependents, accept_terms, favorite_color,
visited_province_alberta, visited_province_bc, visited_province_manitoba, null as visited_province_pei
from data_warehouse_submissionv1 where latest = true
UNION
select
auto_id, survey_id, submission_id, first_name, last_name, null as date_of_birth, null as number_of_children, number_of_dependents, accept_terms, favorite_color,
visited_province_alberta, visited_province_bc, visited_province_manitoba, visited_province_pei
from data_warehouse_submissionv2 where latest = true;
- Each submission table could have multiple rows with the same submission_id (if submission was edited). But there should only be one latest submission across all tables.
- Would not attempt to detect name changes (renaming). Best to leave them as two separate columns and let the analyst choose to merge if desired.
- Would have to compare both name and type. If name is the same, but type is changed, would need prefix or suffix to create separate columns.