Data Cleaning best practice

I have more of a process question than a Kobo-specific question. I hope that is ok.

I have a survey on-going for the next week. I have been reviewing results as they come in to check that there are no faults in the forms and that questions are getting correctly recorded etc. There are some small issues that have arisen, like ID codes being entered in the wrong order - which is easily rectified - but it takes a lot of time to do manually in the Kobotoolbox site. I’m thinking of waiting to fix these issues once the survey is complete, but I don’t want to delay fixing issues if it’s better to have this online data accurate.

What do you do for cleaning data? Is it better to have the online data accurate (fix it online), or is it better to have the online data in original form, and edit/clean data in an excel / software form?

Thanks for your thoughts.

3 Likes

Hi @david_shields,

I find a lot of data cleaning issues can be taken care of within the form itself, using constraints or validation criteria.

For example - ID codes - is there a specific “format” of the codes? You can set up a “regex” validation that checks if the format is correct. Or you can compare it to a list of “valid” ID codes that you upload with the form itself.

Do you have specific issues that we can help you with? Because this is definitely something that everyone goes through, so would be happy if I could help!

What you could do is simply update the validation criteria, even if you’re mid-survey - and it can go into effect so you can prevent more data cleaning problems.

If you want to talk further, let me know - could always jump on Skype or something.

Janna

3 Likes

Thanks. Yes the problem was that when the survey was prepared we didn’t know the codes or numbers of enumerators. We do now of course. How do we add a validation criteria that restricts to a certain number of letters and numbers? I don’t know how to do that with text or non-questions.

Maybe we can base it o

Hi @david_shields,
Just as a quick example, I have ID codes that look like this:
UR0864B

If I make the question type a “text” type, then my validation criteria or constraint would be this:
regex(., ‘[U][R][0-9][0-9][0-9][0-9][BE]’)
This means that they must start the ID with a “U”. Then they must enter an “R”. Then they enter any four numbers(digits). Then the last letter is EITHER a B or an E.

If you share your format here, I’m sure we can help you discover the regex for it…if you’re not able to figure it out through googling :slight_smile:

Cheers,
Janna

3 Likes

FYI you can simplify this a bit further to:

regex(., ‘^[U][R][0-9]{4}[BE]$’)

{4} indicates repeat 4 times. I also usually explicitly add ‘^’ (start of string) and ‘$’ (end of string), otherwise you can have garbage at the beginning or end of the string, and it will still successfully match (ie pass validation) provided there is any substring that matches!

6 Likes

Thanks a lot Xiphware, you have made it to look very simpler and straight forward. Just a quick quiz, how can we make the regex to accept both upper cases and lower cases of the letters U, R, BE? I will appreciate your prompt feedback.

1 Like

regex(., ‘^[Uu][Rr][0-9]{4}[BbEe]$’)

I highly recommend testing your regex’s with one of the available online tools to make sure they actually do what you want them to, before going to the trouble of updating your forms. Saves a lot of time :slight_smile:

3 Likes

Wow! Great. Thanks buddy

1 Like

Good day!

I would like ask for your help regarding data cleaning. Currently, I am doing a survey but I am already starting cleaning the data while the survey is underway not to get overwhelmed with the bulk data. While checking on the data, I have observed that some information is sometimes not uploaded or if it is uploaded sometimes the data could be found in a different question, e.g. income coming from vegetable production sometimes could be found in construction. And is it fine to edit and clean the data on the kobo server? I want to do the data cleaning on the kobo server because it automatically does the analysis.
I hope you can help me! Thank you so much. God bless.

Hi!
This is good that you are checking on your data to ensure it’s clean right away.
It seems very strange that some of your data is coming in under the wrong columns, so let’s figure it out, as that will be very painful to clean up in the long run!
Is it possible to share either the XLSform version of your form, or a snapshot of the submitted data that could help us troubleshoot what the problem might be?
Hope we can help you!
Janna

1 Like

Hi @Jodilag
Welcome to the community forum.

Definitely data cleaning is an important process in your day to day handling of your data as noted by @janna.

We need to have a closer look at your data to really determine the behaviour. On a quick note, could you send us

  • Screenshots of what you have on your end
  • Indication of how many times this project has been re-deployed
    -Additional information that would point us to the exact behavior on your form.
    -Confirmation that the data is ok when downloaded and is only problematic while on server OR vise versa.

Yes it is fine to edit on server. Could you first provide the information asked above for us to provide you more support.

Stephane

1 Like

Hello!

Good day! Yes, I can share with you the XLSForm version of the data. Do I have to send it here? Thanks a lot.

Jodilag

Hi @Jodilag,
Yes, if you can share the XLSForm here, that would be great.