A Note about Duplicate Submissions via KoboCollect

I just started a survey (3000 sample size) where submissions are done via 2G/3G network in East Africa.
So far after about 1900 submissions, about 83 are duplicates. I mentioned this issue here and Mitch replied with the following statement:

The critical question is whether you defined an instanceId in your form.
http://groups.google.com/group/opendatakit/browse_thread/thread/f77bf8942cdfe182/12e7e33119a83172?lnk=gst&q=instanceId#12e7e33119a83172

If you did not define an instanceId, ODK Aggregate cannot de-duplicate your data. I just noticed that the Opendatakit.org form design pages don’t mention this important aspect of form design. I’ll update them next week.

Mitch

I then posed the following question to Mitch (and I’m hoping the KoboCollect developers can also add their views too):

I’m still wondering…if ODK Collect (or KoboCollect in our case) has successfully
submitted a form AND its status has changed from FINISHED to SUBMITTED, then why would it even re-submit said instance – regardless of whether an instanceid field is defined or not? Just curious.

So, are the duplicates our survey is experiencing due to bad network? Because I’m thinking that if a form is already submitted, the application KoboCollect or ODK Collect shouldn’t even try to submit it again.

Looking at the Aggregate web interface on Appspot, I see a clear duplicate with the following details:

  •     Mobile ID3 has duplicate data with the difference only in the end times. The first submission has an end time of **10:31:16**          and the second submission has an end time of **14:07:08**. All the other data captured are the **same** for the two duplicate submissions. This is a concurrent data collection using a single mobile phone which is not possible. Hence a 4 hour submission lag yet still a duplicate.
    

So, is this duplicate thing due to a bad network? Eventually, the forms are successfully
submitted! Or, is it an issue with the data collector, perhaps, not using KoboCollect correctly. We did provide training a few days ago.

Finally, has anyone used this solution to avoid duplicate entries in the future as suggested by Mitch?

For ODK Aggregate, you don’t need to specify the namespace, just have a group in your form.

With ODK Collect 1.1.7 and later, the bind for this element that replicates the instanceID that would otherwise be generated by Aggregate would be:

You can construct your own instanceID expressions. However, you should avoid symbols and punctuation other than colons and dashes since the parsing logic within Aggregate is likely fragile if you go wild with punctuation (and that is used later on when retrieving images, repeat groups, etc.).

~DataMax

···

=============

DataMax,

Interesting problem. It is largely on ODK Aggregate, which we don’t support, so I will leave it to the very able ODK guys to speculate on why Aggregate is allowing duplicate records. I will say that 1.) Duplicate records is not the worst thing that can happen to you. Lost records is the worst thing that can happen, you can always weed out duplicates later. You can’t fix a lost record. 2.)I would be interested to know if the <instanceID/> solution worked for solving this problem. The Meta Data scheme referred to by Mitch is a good idea.

Your level of duplicates 83/1900 is unpleasantly high. It is high enough to distract you from some other things you should look for. Are you losing any records, and are the records that do make it to the server complete and accurate?

My suggestion to you is to make a comparison between your local and remote records. Get all your phones together and use KoBoSync to collect and aggregate your records into a CSV file. KoBoSync is really very reliable, so you can compare your local data set to the remote data set with some confidence.

Are your problems due to bad network? Probably. It seems like you have equipped all your devices with internet enabled SIMs for data access, so If you want to try another networky way of synchronizing your data, you can try something that we do. Create a dropbox account for your data to sync to. Install DropSync on your phones and set them to sync up the /sdcard/odk/instances/ folder. Dropsync is very good about synchronizing files over non-optimal networks, so it is a good way, when you have some network access, to collect your files. Then, from your dropbox, you can create your CSV of aggregated data, or push it all to Aggregate server on a more reliable connection.

Finally, you need to look at your original data to determine if you are doing something weird on the phone that would cause this:

“Mobile ID3 has duplicate data with the difference only in the end times. The first submission has an end time of 10:31:16 and the second submission has an end time of 14:07:08. All the other data captured are the same for the two duplicate submissions. This is a concurrent data collection using a single mobile phone which is not possible. Hence a 4 hour submission lag yet still a duplicate.”

Look at the device that created this and find out if there is only 1 original record, or if there are 2. If only 1, is it’s end time 10:31:16 or 14:07:08.

OK, DataMax, those are my thoughts. This is an interesting problem, it is the kind of thing that sometimes pops up in large deployments like yours. If a problem happens one tenth of one percent of the time, you need to collect 1000 surveys to see it occur. I’d like to get a good definition of the problem and a good solution that we can enter into the KoBo User Guide. So, please, share with us how it goes.

~Neil Hendrick

KoBo Developer

···

On Mon, Jun 4, 2012 at 4:14 AM, DataMax maxth...@gmail.com wrote:

instanceId

Hello I think I have a solution to what you are encountering. Would lke to get in touch. I have been using Kobo Collect for the last 9 months and have so far rolled 14 surveys with it in Kenya. Maybe I can help you sort it out.

Stephane Aloo

···

On Monday, June 4, 2012 11:14:51 AM UTC+3, DataMax wrote:

I just started a survey (3000 sample size) where submissions are done via 2G/3G network in East Africa.
So far after about 1900 submissions, about 83 are duplicates. I mentioned this issue here and Mitch replied with the following statement:

The critical question is whether you defined an instanceId in your form.
http://groups.google.com/group/opendatakit/browse_thread/thread/f77bf8942cdfe182/12e7e33119a83172?lnk=gst&q=instanceId#12e7e33119a83172

If you did not define an instanceId, ODK Aggregate cannot de-duplicate your data. I just noticed that the Opendatakit.org form design pages don’t mention this important aspect of form design. I’ll update them next week.

Mitch

I then posed the following question to Mitch (and I’m hoping the KoboCollect developers can also add their views too):

I’m still wondering…if ODK Collect (or KoboCollect in our case) has successfully
submitted a form AND its status has changed from FINISHED to SUBMITTED, then why would it even re-submit said instance – regardless of whether an instanceid field is defined or not? Just curious.

So, are the duplicates our survey is experiencing due to bad network? Because I’m thinking that if a form is already submitted, the application KoboCollect or ODK Collect shouldn’t even try to submit it again.

Looking at the Aggregate web interface on Appspot, I see a clear duplicate with the following details:

  •     Mobile ID3 has duplicate data with the difference only in the end times. The first submission has an end time of **10:31:16**          and the second submission has an end time of **14:07:08**. All the other data captured are the **same** for the two duplicate submissions. This is a concurrent data collection using a single mobile phone which is not possible. Hence a 4 hour submission lag yet still a duplicate.
    

So, is this duplicate thing due to a bad network? Eventually, the forms are successfully
submitted! Or, is it an issue with the data collector, perhaps, not using KoboCollect correctly. We did provide training a few days ago.

Finally, has anyone used this solution to avoid duplicate entries in the future as suggested by Mitch?

For ODK Aggregate, you don’t need to specify the namespace, just have a group in your form.

With ODK Collect 1.1.7 and later, the bind for this element that replicates the instanceID that would otherwise be generated by Aggregate would be:

You can construct your own instanceID expressions. However, you should avoid symbols and punctuation other than colons and dashes since the parsing logic within Aggregate is likely fragile if you go wild with punctuation (and that is used later on when retrieving images, repeat groups, etc.).

~DataMax

Hi Stephanie!

Did you find a solution to this issue? I am experiencing the same issues. Same start time and data but just different end times in duplicate entries.

Also trying to remove duplicates, have defined an instance_id but not too sure how to then remove these. I am using KoboToolbox and have a direct API to Google Data Studio for analysing the data, would prefer not to have to have something in the middle of these two and would ideally love to have the duplication remover somewhere in between.

:slight_smile: Harry

···

On Sunday, 3 February 2013 09:07:24 UTC, Aloo Stephen wrote:

Hello I think I have a solution to what you are encountering. Would lke to get in touch. I have been using Kobo Collect for the last 9 months and have so far rolled 14 surveys with it in Kenya. Maybe I can help you sort it out.

Stephane Aloo

On Monday, June 4, 2012 11:14:51 AM UTC+3, DataMax wrote:

I just started a survey (3000 sample size) where submissions are done via 2G/3G network in East Africa.
So far after about 1900 submissions, about 83 are duplicates. I mentioned this issue here and Mitch replied with the following statement:

The critical question is whether you defined an instanceId in your form.
http://groups.google.com/group/opendatakit/browse_thread/thread/f77bf8942cdfe182/12e7e33119a83172?lnk=gst&q=instanceId#12e7e33119a83172

If you did not define an instanceId, ODK Aggregate cannot de-duplicate your data. I just noticed that the Opendatakit.org form design pages don’t mention this important aspect of form design. I’ll update them next week.

Mitch

I then posed the following question to Mitch (and I’m hoping the KoboCollect developers can also add their views too):

I’m still wondering…if ODK Collect (or KoboCollect in our case) has successfully
submitted a form AND its status has changed from FINISHED to SUBMITTED, then why would it even re-submit said instance – regardless of whether an instanceid field is defined or not? Just curious.

So, are the duplicates our survey is experiencing due to bad network? Because I’m thinking that if a form is already submitted, the application KoboCollect or ODK Collect shouldn’t even try to submit it again.

Looking at the Aggregate web interface on Appspot, I see a clear duplicate with the following details:

  •     Mobile ID3 has duplicate data with the difference only in the end times. The first submission has an end time of **10:31:16**          and the second submission has an end time of **14:07:08**. All the other data captured are the **same** for the two duplicate submissions. This is a concurrent data collection using a single mobile phone which is not possible. Hence a 4 hour submission lag yet still a duplicate.
    

So, is this duplicate thing due to a bad network? Eventually, the forms are successfully
submitted! Or, is it an issue with the data collector, perhaps, not using KoboCollect correctly. We did provide training a few days ago.

Finally, has anyone used this solution to avoid duplicate entries in the future as suggested by Mitch?

For ODK Aggregate, you don’t need to specify the namespace, just have a group in your form.

With ODK Collect 1.1.7 and later, the bind for this element that replicates the instanceID that would otherwise be generated by Aggregate would be:

You can construct your own instanceID expressions. However, you should avoid symbols and punctuation other than colons and dashes since the parsing logic within Aggregate is likely fragile if you go wild with punctuation (and that is used later on when retrieving images, repeat groups, etc.).

~DataMax