We’ve realized that a form created some duplicate submissions. It is data collected using Enketo on tablets, on the researcher server. The duplicates have exactly the same data and metadata, and even the same _id and _uuid (which should be impossibe right?).
Do you know what’s happening? I’m worried that deleting a duplicate will delete the 2 submissions.
Would you mind sharing the screenshot of the issue. It would be very helpful. Could you also let us know if your survey project has some image questions that should be collected?
Hi @dianedetoeuf
Thanks for sending the information you did. We have had a chance to review this with our developers and we noted that there is a bug that the team would be working on. We apologize for any inconvenience caused. The only option now is to have the duplicate deleted; this will not delete the other copy. It could be a good thing to download your data as it is, just in case you are a bit worried about the deletion of both.
Yes (with the caveat of us developers not having reproduced it)
Yes (same caveat)
There are a few scenarios that people report as “duplicate submissions”:
The first, which describes Diane’s case, is true duplication, where the XML submissions are completely identical. I consider it a bug that KoBoCAT does not reject a submission whose identical XML already exists in another submission belonging to the same project. This problem is best detected by, first, looking for duplicate UUIDs, and then comparing the XML for any submissions that share the same UUID. Note the _id of each suspicious submission and retrieve its XML from the API, e.g. at https://kf.kobotoolbox.org/api/v2/assets/aYourProjectUid/data/12345.xml, where 12345 is the _id of the submission you want to retrieve.
A different scenario consists of submissions that share the same UUID but have different XML contents. UUIDs are generated by the client (Collect, Enketo, someone posting XML to the API, etc.), not the KoBo server. Some OpenRosa implementations reject duplicate UUIDs, but we err on the side of never discarding legitimate data—and we do not plan to change this behavior. This situation can be detected in a similar manner to the previous one: look for the duplicate UUIDs, and then compare the XML. If the XML differs, then the problem lies with the client. Obviously, if you see different responses in submissions that share the same UUID, you don’t need to go to the trouble of comparing the XML.
There’s a lot of KoBoCAT work in the queue ahead of this. Honestly, we probably won’t begin to address it until the first quarter of 2021.
I have also seen this problem before, the only thing to do it was cleaning the duplicated submission. Then, for this second time, you answer to Diane’s request will help us to fix this problem.
Kindly please be informed that this is not possible as _id is the ID provided by the KoBoToolbox system which is unique for each server i.e. HHI, OCHA or a self hosted server. Maybe this post discussed previously should also help you understand what _id is much better:
We had another issue with duplicated uuid, affecting 1% of the data (which is quite high!).
Did someone manage to replicate the issue and try to solve it?
Hello @dianedetoeuf,
Could you provide more details, please:
At which place do you get/see the duplicates? Table view? Export?
Could you try several times, with performant server/internet connection, if you get the same duplicates?
In your previous example (Sep 2020) the cases in your screenshot had only same _uuid, but different _id. (Same in screenshot of @Bernard_26). Is this the same today?
Are all other data of the cases the same, incl. internal data like full submission_time?
What happens if you try to view or edit the duplicates? Did you ever use Briefcase on this data set?
Do the cases have (multiple) media attachments?
Could someone from the KoBo Core team verify for this project that there are really duplicates in the database (or only in the table view)?
As mentioned above by jnm, there is an open github bug report (since July 2018).
There is also another ODK thread on duplicates here:
https://docs.getodk.org/aggregate-data-access/#publishing
“Under certain failure conditions, the downstream service can receive multiple copies of a given submission. This is known, expected, behavior.
Duplicates typically occur if the downstream service is slow to respond or acknowledge a request. It is your responsibility to detect and eliminate these duplicates should they occur (they will always have exactly the same information in all fields).”
Hi all, I’ve got a similar problem as well. As @dianedetoeuf mentioned ~2% (22 out of 1024) of my submitted forms have duplicated _uuids (11 duplicated uuids for 22 submissions). Their _ids are unique, and the submissions are not duplicated -they are genuine submissions.
I don’t know if it helps but here are the answers to the questions that @wroos asked:
UUID duplicates are both in table view and excel export.
Can’t reproduce it.
ID’s are different, but the uuid’s are duplicated.
Other than the username of the person, most of the data is different. (got 134 questions which some of them expected to be have same answers, not a duplication issue.)
When trying to view the submission, it works as expected, BUT when I try to edit I get an error:
(this might be due to today’s server issue, will check again when the server is acting a little bit better.)
(By the way, could there be a relation between this issue and the can’t edit issues recently?)
The data server for your form or the Enketo server is down. Please try again later or contact support@kobotoolbox.org. (500)
Forms don’t have any type of media attachment.
More information:
All the duplicated UUID’s occur within the relevant account. See screenshot for details:
(In other words, there is no duplicated UUID between different accounts)
(Censored usernames and Most of the UUID’s for security reasons.)
All of them are submitted via web.
One of the things I noticed is the submisson time. In @dianedetoeuf’s, @Bernard_26’s and my cases, the submission time of the duplicated UUID’s are so close to each other.
Hello @hakan_cetinkaya,
thanks for the details and research!
Did I understand well?
The __uuid duplicates only happened with submissions from the same device and username.
Do these users have different project/server accounts?
Are the duplicates from directly sequential cases/submissons from the same user?
Are all cases from a common time slot?
Are there cases/submissions without duplicates (from another or the same device/user) between a pair of duplicates). Regard timestamp and _id?
What about the end timestamp (metadata) of the duplicates, (also in relation to other cases).
Which server did you use?
Which browser(s) did you use?
Someone of the Core Team might explain, please, when exactly is the __uuid generated? (And how a duplicate might happen?)