Categorical data from Kobo to R

Hi Community,

I am currently working on survey data pulled from kobo. The survey consists of about 400 variables/questions with a larger proportion of the data comprised of categorical variables. Given that the survey was in two languages, I exported it using the XML headers and value option.

Now that I am working with it in R, do I need to change all the xml values (observations) to the full full labels i.e “not_sure” to “I am not sure” for each variable or is there a less cumbersome way to do this?

You can export SPSS labels files (multiple languages). And there is probably a way to import them (adapted?) into R. See e.g. Working with SPSS labels in R | R-bloggers.

1 Like

Welcome to the community, @Sticks! Pinging @dickoa (who has been very helpful in the community documenting and sharing how to integrate KoboToolbox to r) should be able to guide you with r related issues.

1 Like

Hi @Sticks,

You want to change the label in R (e.g from no_sure to I am not sure) for your analysis and not touch at the Kobo form right ? I’m trying to understand what you want to do.
I’ll also be interested in wow do you load the Kobo data in R ?

1 Like

Yes. To clarify, since the survey was done in two languages, I downloaded a csv file of the survey responses selecting the xml values and headers option.

Given that all responses now reflect everything that would have been under the names column in the xls form, I would like to change all responses back to what would have been in the label column all in one language. So in this light is there an easier way to do this in R or alternatively a much better way for working with kobo data in R

Thanks @Sticks. I can share with you one way to do it using robotoolbox. You can start by going through the introductory vignette and the article on labels. Here is quick example

library(robotoolbox)
library(tidyverse)
library(labelled)

###
url <- "https://kf.kobotoolbox.org"
token <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx8ae99"

### 
kobo_setup(url = url,
          token = token)

### uid of the project on kf.kobotoolbox.org
uid <- "aEwTYNcU76UvLgiY89rPWm"

###
data <- kobo_data(uid)
glimpse(data)
## Rows: 95
## Columns: 23
## $ start                <dttm> 2022-03-06 16:56:39, 2022-03-06 16:55:02, 2022-0…
## $ end                  <dttm> 2022-03-07 16:24:54, 2022-03-06 16:56:09, 2022-0…
## $ dima                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ work_loc             <chr+lbl> "country_operation", "country_operation", NA,…
## $ ops                  <chr+lbl> "cod", "cod", NA, NA, NA, "nga", "mli", NA, "…
## $ mco                  <chr+lbl> NA, NA, NA, NA, NA, NA, NA, "cmr-mco", NA, NA…
## $ bureau               <chr+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ area                 <chr+lbl> "none_above", "none_above", NA, NA, NA, "comb…
## $ other_area           <chr> "Protection", "Protection", NA, NA, NA, NA, NA, N…
## $ r_knowledge          <chr+lbl> "none", "none", NA, NA, NA, "none", "none", "…
## $ stats_background     <chr+lbl> "yes", "yes", NA, NA, NA, "yes", "yes", NA, N…
## $ wishlist             <chr> NA, "Je souhaite apprendre du logiciel R pour fac…
## $ useful_links         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ `_id`                <int> 18529936, 18509567, 18509535, 18509523, 18509508,…
## $ uuid                 <chr> "f3636e5be98042b1b5535bf580638505", "f3636e5be980…
## $ `__version__`        <chr> "v7Dqeq9db9wZSs9rfwkFa9", "v7Dqeq9db9wZSs9rfwkFa9…
## $ instanceID           <chr> "uuid:45694127-61f7-43fc-a532-43da1e74018c", "uui…
## $ `_xform_id_string`   <chr> "aEwTYNcU76UvLgiY89rPWm", "aEwTYNcU76UvLgiY89rPWm…
## $ `_uuid`              <chr> "45694127-61f7-43fc-a532-43da1e74018c", "b2939016…
## $ `_status`            <chr> "submitted_via_web", "submitted_via_web", "submit…
## $ `_submission_time`   <dttm> 2022-03-07 16:25:05, 2022-03-06 16:56:18, 2022-0…
## $ `_validation_status` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ `_submitted_by`      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Now that we have the data, we can use one labelled column and changed a label. We can pick the area column.

## checking the variable without labels
count(data, area)
## # A tibble: 6 × 2
##   area                                                       n
##   <chr+lbl>                                              <int>
## 1 combination [A combination of the above]                   5
## 2 im [Information Management]                               24
## 3 none_above [None of the above - other functional unit]    21
## 4 odm [Operational Data Management]                         15
## 5 registration [Identity Management and Registration]       13
## 6 NA                                                        17

## generating labels using to_character from the labelled package
count(data, area_lbl = to_character(area))
## # A tibble: 6 × 2
##   area_lbl                                      n
##   <chr>                                     <int>
## 1 A combination of the above                    5
## 2 Identity Management and Registration         13
## 3 Information Management                       24
## 4 None of the above - other functional unit    21
## 5 Operational Data Management                  15
## 6 NA                                           17

val_labels(data$area)
##                    Information Management 
##                                      "im" 
##      Identity Management and Registration 
##                            "registration" 
##               Operational Data Management 
##                                     "odm" 
##                A combination of the above 
##                             "combination" 
## None of the above - other functional unit 
##                              "none_above" 

Now, let’s change the label of none_above set to None of the above - other functional unit. Let change the label to just None of the above instead of None of the above - other functional unit.

## check the label of the value `none_above`
val_label(data$area, "none_above")
## [1] "None of the above - other functional unit"

## change the label of the value `none_above`
val_label(data$area, "none_above") <- "None of the above"

## check again
val_label(data$area, "none_above")
## [1] "None of the above"

 count(data, area_lbl = to_character(area))
## # A tibble: 6 × 2
##   area_lbl                                 n
##   <chr>                                <int>
## 1 A combination of the above               5
## 2 Identity Management and Registration    13
## 3 Information Management                  24
## 4 None of the above                       21
## 5 Operational Data Management             15
## 6 NA                                      17

robotoolbox relies on the labelled package for value and variable labels manipulation, you can learn more about the labelled package in the intro vignette.

I hope it helps

2 Likes

This is an awesome solution.

Thank you for your support.

1 Like

Hi again.

So I tried out the code and R is now connected to the kobo interface and I can see the asset list comprised of all our projects. I somehow can’t pull the data. See the error message below for the kobo_data(uid) function:

Error in stri_detect_regex(y, paste0(“\b”, uv[iter], “\b”)) :
Incorrectly nested parentheses in regex pattern. (U_REGEX_MISMATCHED_PAREN, context=\biud)\b)

Hi @Sticks ,

What’s the output of

library(robotoolbox)
packageVersion("robotoolbox")
kobo_settings()

and the output

project <- kobo_asset(uid)
project

Can you read other projects data without issues ?

Best,
Ahmadou

1 Like

For the version, I have 1.3.2 installed.

For the settings, It displays my URL and part of my API token.

I also cannot also read in other surveys. I still get the same error. See code below:

library(robotoolbox)
library(tidyverse)
library(labelled)

#Kobo account details
url ← “https://kf.kobotoolbox.org
token ← “xxxxxxxxxxxxxxxxxxxxxxxxxxxxx3ad”
kobo_setup(url = url,
token = token)

uid ← “aBoxkytRejBVvbusYsWNp5”
asset ← kobo_asset(uid)
data ← kobo_data(asset)

Error (for all forms) : Error in stri_detect_regex(y, paste0(“\b”, uv[iter], “\b”)) :
Incorrectly nested parentheses in regex pattern. (U_REGEX_MISMATCHED_PAREN, context=\biud)\b)

Thank you @Sticks for the feedback.
I’ll need to find a way to reproduce it. In the meantime, you can update all the packages (including stringi) just to make sure that the issue is not linked to the fact that we have different versions, particularly for robotoolbox dependencies. Is it possible to have a version of your xlsform (if there’s nothing sensitive on it, of course). The issue is linked to select_multiple questions choices.

1 Like

Thank you @dickoa I appreciate the support thus far.

I have updated all the packages so long.

I unfortunately cant send the xlsform due to the sensitivity of the survey. Could we perhaps find a way around it to help you find out what the issue may be?

Hi @Sticks,
I totally understand about the sensitivity and will see what I can do to find where the culprit is.

The issue is linked to some of the select_multiple choices values, some of them are not properly parsed using the regex to turn the select_multiple column to multiple dummy variables (1/0). It’s the first time to encounter such error, it’ll be great to be able to reproduce and fix it. I started playing with different type of select_multiple choices.

The error is coming from this line

And I have other way to do this regex, but I first need to reproduce it, I’ll continue to dig, and I’ll let you know as soon as I have something similar. If there are select_multiple column with special characters or if they’re using unicode or language non encoded in ASCII let me know.

1 Like

Idea: Parsing problem might be related to special choice names, e.g. with _ (underline), . (dot), blanks or tags or duplicate choice names Maybe you could share the parsing function (stri_detect_regex) here?.

1 Like

That’s a good lead @wroos, I’ll add it to my test and also try with choices in arabic or other non-ascii languages.

2 Likes

Thanks for your work on the tool, @dickoa.
Based on your parsing function try to image what people may “create” as choice names, i.e. as regex input.

1 Like

This a good one!

I do have duplicate choice names, particularly in the names column of the xls choices sheet. Could that be the cause of the error?

I just managed to pull data from a dummy (test) form created without duplicates in the names column.

@dickoa Could this be modified in the source code to allow it to pull data from forms with repeated xml values/choice names? This was a large survey, so at some point it became tedious having to design a new xml value for a choice that was different by just one word in the label column and at the same time too long to put out the full word separating the terms with underscores.

1 Like

That’s interesting, I remember testing against duplicated column names, but I’ll check again.

I’m also following another lead, and it’s related to the presence of parenthesis. Do you have some parenthesis in some choices (“(”, “)”, etc.). ?

Best,
Ahmadou

1 Like

I would recommend to always check the form during development and before deployment with the online validator which will detect duplicate names.
A lot of problems posted to this forum could be avoided this approach.

1 Like

There are no parentheses, but I have an apostrophe " ’ " and integers as choices.

We asked participants to rate the survey out of 10 but rather than setting it up as an integer question type, we chose a select_one question with options 0 all the way up to 10 as choices in the name coulum.

Additionally, some choices have numeric characters in them.
e.g: only_use_5 and the adjacent label would be “I only use five”.