Anonymization of personal data (deleting the entries of sensitive data before submitting to the server)

wroos · May 2, 2019, 8:58pm

Hi KoBos,
we are working with KoBoCollect and looking for an option to anonymise personal data before (or after submission). The target is to exclude the possibility to identify a household. As we work with refugee data, even an important security reason.
Examples are: Names of head of household and household members, gps coordinates of household. This data are needed during the interview, even referenced in some qustion labels, like: What is the age of ${person}?
(The question is partly related to a former topic, where we found out that we cannot set KoBo variables, see Setting values for KoBoCollect internal variables. )
We know about the KoBo option of encrypting the whole data, but this will still show the personal data in briefcase, and creates disadvantges for data validation online (on server).
Kind regards

wroos · October 26, 2019, 7:28pm

Dear KoBos,
the question is not yet answered, please. We think, anonymisation of field data is a general request.

We meanwhile found the following hint on challenges (from another topic):
https://community.kobotoolbox.org/t/questions-on-multiple-kubo-parameters-capacity/3058/2
“… If you … need to remove GPS data after it has been collected, there’s not yet an easy way to edit submissions in bulk and remove a single response. You could export your data and then delete all the submissions completely from our servers. Alternatively, you could edit the submissions individually within KoBo to remove the GPS data, or (advanced and unsupported) use the API to mass-edit submissions.:”
At least, an example for the API option could help us.
Kind regards

wroos · October 28, 2019, 12:17pm

Dear @Kai_Lam,
Thanks for your added tag.
I trust you know: Encryption and anonymisation are two different issues. Anonymisation means to delete (or recode) values from fields, which could be used to identify a person, household, location etc., like GPS co-ordinates. This is normally needed before data sets are shared with stakeholders or researchers.

Some examples for anonymisation options:

Delete all identifying values (or even variables) from the data set.
Recode identifying variables to anonym values, like HHmember 01, … 02 (as name).
Separate (and remove) possibly identifying information from the main data, into a separate data set, which can only be accessed and linked by few people.
Encrypt (only) the variables with sensible data.

We are looking for a best way to do this with KoBo.
Kind regards

stephanealoo · October 29, 2019, 3:54pm

Hi,
As you may have realized it is not possible to encrypt data partially in a manner that the data is excluded. I however recommend the following non system specific actions that can help anonymize data:

On names:: You should only document single names only and no family names for the purposes of referencing in the subsequent parts of the question.
Do you need to collect GPS coordinates? Do you need to analyze them? If NO then do not collect if you do not need
Generally we do not use identifiers for analysis; so if you do not need to analyze do not use it.
Control your data processing point i.e. limit this to one person or persons who have already agreed to subscribe to strict data protection principles that you may have put in place.

Regards
Stephane

wroos · October 29, 2019, 9:33pm

Dear Stephane,
thanks for your prompt reply. We share your general hints/suggestions.
But on KoBo community level we are looking more how (post-enumeration) anonymisation can be best be done (system specific possibilities).
For example:

If you do listing before sampling and then interviewing you need to find the sampled household again later, So, you normally you need names, GPS location, etc.
GPS is needed to control that only HH in the sampled area are taken (and to monitor field work.).
So, we are looking for efficient options to hide, recode, delete values of certain KoBoCollect variables after enumeration.
Info on best practices with KoBoToolbox would be highly appreciated.
Kind regards
Wolfgang
.

edmond.wach · November 1, 2019, 3:08pm

Dear Wolfgang,

Many thanks for raising those points. I just wanted to highlight the fact that I fully agree with you and that with the global (and needed !) trend about data protection in the sector the mentioned features (partial encryption, post anonymisation…) are really needed to be able to continue to use Kobo in a responsible way (as you mentioned we can’t indeed avoid collecting names, GPS location…). One of the current bottlenecks is also linked to the absence of possible “organizational account” allowing an admin to know who has access to which dataset (of course one main account could share with all needed individuals the relevant project but this isn’t convenient and practical) as a real team administration feature.

Kind regards, Edmond

stephanealoo · November 4, 2019, 4:14pm

Hi
Just following up on this. Unfortunately there is no way to automatically control this process. I suggest that you create a way to manually delete the information so that you do not have the information. If I was to do this, I would however use an off system process such as creating an external file that can be used to get the details of the respondents using functionalities such as pull data.

Regards
Stephane

wroos · October 5, 2020, 4:03pm

Dear @stephanealoo, dear @Kal_Lam,
Allow us to come back to this discussion, after European Data Security regulations have been restricted again.

The data security department asked us to NOT store any person-references. So, we are looking for a solution to control this. We followed the approach from Stephane from the beginning, we only use first names (or pseudo-names) in the interview process (and in monitoring) during field work.

As we have a comprehensive Household and Individual questionnaire, we need to reference a person during the interview. Having bigger and heterogenous households, things like “my third oldest child” or the “younger brother of my wife” would become confusing for the (main) respondent and the interviewer and create severe data quality issues if a reference gets mixed up.

To use a first name seems a smarter solution, but then we need to remove this “name” BEFORE (or during) data transmission to the OCHA server. To let the interviewer do this manually at the whole end of each interview, i.e. before transmission (or finalisation) doesn’t seem a practicable solution, as this would need moving back with complex navigation in repeat groups, esp. in bigger households.

So, we are urgently looking for a possibility to create an automatic solution to remove the “names” (delete, set to “xxx” …) or find another approach. Is there any option or trick (like using bind and setvalue, or getting a script to intervene before transmission)?

Side-note: We know a way to remove the “names” later before data export from the server for analysis (making a new version of the form without the “name” variable and exporting with it), but this is too late for EU demands.
Encryption & Briefcase is not a solution for us, as sensitive data will even be exported locally.

Maybe also @Xiphware can help?

Thanks in advance and kind regards
Wolfgang

tinok · October 5, 2020, 7:36pm

Hi @wroos, two ideas come to mind:

Deleting the name before submission

You could ask the survey team to delete the names entered at the beginning and add constraints that make it impossible to submit until the names are deleted. I made a quick mockup of that approach here: https://ee.kobotoolbox.org/OY2MK1Dt. You can download the XLSForm file here https://kf.kobotoolbox.org/#/forms/aSMMNPdo3RmYXehMKKkXdQ

Use a device instead of names

Your interviewers could bring a number of objects with them and ask the respondent to assign each to a family member, then refer to that object throughout. E.g. dice = first daughter, pencil = first son, etc.

Register initials instead of names

This could be the first two letters of each name or the first and last letter of each name (e.g. WO or WG instead of Wolfgang). It’s up to your ethics department to decide if that would be acceptable.

wroos · October 5, 2020, 8:50pm

Dear @tinok,
Thanks for your quick reply and the ideas.

Deleting the name before submission

We already discussed this
Not a preferable option, as we have a long questionnaire with several household and individual parts and with repeats and groups to structure it. So, the enumerator would need to move back to each repeat case, move to the name field and set something like “xxx” (as it must be required). Finally navigate to the end and finalise. Maybe even have to navigate back again, if a name-remove was missed. You may imagine: For a 10-person-member household, the navigation alone will be about 50 extra clicks.

Your kf link didn’t work (login screen). Can you provide your XLSForm, please?.

Use a device instead of names

Even more confusing and error-risky probably, as an additional association has to be managed mentally by the respondent and the interviewer. And we use ${Name} references in many dynamic labels (and group titles too).

Register initials instead of names

We discussed to only use first names (or pseudo-names, like “younger son”).
Nice idea to abbreviate, but this is a systematic recode, so not really liked by data protection controllers. Furthermore. the interviewer would need to re-translate the abbreviated ${Name} references in the question labels during the interview.

So far, the best and most practical solution, we are looking for, is a way to use first name during interview and then automatically remove the name references on sent (or finalisation).
This is, as far as I know, an often used approach in Census and international household surveys also.

Still hopeful for a smart solution.
Kind regards
Wolfgang

stephanealoo · October 5, 2020, 9:31pm

Hi @wroos
Just a quick one on the last part of your query and response to Tino.
So far, the best and most practical solution, we are looking for, is a way to use the first name during the interview and then automatically remove the name references on sent (or finalization). This is, as far as I know, an often-used approach in Census and international household surveys also.

In census surveys and international household surveys, the norm has been to use the first names. While most had been done on a paper interview, it was easier to strip this part of the form prior to data entry thus rendering the data unlinked to the original set. As it is, with the current technological shift of electronic data collection, this would require two separate forms working on a relational basis. However, as we know, this is not the design of ODK based forms.

I know that something like CSPro has been able to do this by having localized server behavior within the data collection device. I honestly think this would be such a long shot from a design perspective. I however think that the easiest approach is to make a suggestion for a future where hypothetically:

We introduce a column during the Xform design for indetifier_information where the default is blank hence false and the assignment of true defines the question/variable as a piece of identifier information.

image1286×177 8.2 KB
Changing the way the system handles identifier_information= “true” such that the data is collected but not submitted.

image1282×181 8.02 KB

I would not know the design structural approach that would have to be made for the form to consider them as such.

Stephane

Xiphware · October 6, 2020, 12:08am

Another quick (and dirty?) option might be to make any name field you wish to blank out as relevant on a final submit trigger. eg

name-to-hide: relevant = ${finished}!='OK'

and having the final question of your form be:

finished: type=trigger, label="Are you finished?"

The names would remain during form filling, but would become non-relevant (and so blanked on submission) soon as you hit the final trigger question. Oh, and make the trigger mandatory so that you cant submit the form without hitting it.

wroos · October 6, 2020, 6:37am

Thanks @Xiphware,
Great workaround.

Update:
We do NOT need to add to name-to-hide required = ${finished} != ‘OK’. This is controlled by the relevant too.
So, name-to-hide field is just required = true (as we use it in dynamic question labels).

Is there any possibility to use a finalise or sent system event/status for your relevant? This would make the solution more automatic.
Kind regards
Wolfgang

Kal_Lam · October 6, 2020, 8:06am

This is for those who have been trying to follow what @Xiphware has tried to outline in this post. You could do the same as outlined in the image shared below:

In the survey tab of your xlsform:

Data entry screen (1) as seen in Enketo:

As soon as you press the OK. You should see the following screen…

Data entry screen (2) as seen in Enketo:

Here, name 2 is automatically marked for deletion. You need not delete the unwanted variables manually. The system deletes for you.

Data as seen in the KoBoToolbox server:

Reference xlsform:

name-to-hide.xlsx (8.7 KB)

Thank you @Xiphware for this wonderful workaround!

Xiphware · October 6, 2020, 8:30am

Thank you @Kal_Lam for going to the trouble of putting it all in an actual form which others can leverage

wroos · October 6, 2020, 10:51am

Thanks to ALL of you!
Still open for discussion:

Is there any possibility to use a finalise or send system event/status for your relevant? This would make the solution more automatic.
(@Xiphware, @tinok, @Kal_Lam, please?)

I also did a small update in the last post.

A small disadvantage of the workaround, esp. for KoBoCollect (style pages), is:
Any required field (here the finished variable) blocks the user to move with normal flow (Next button or swipe) to the end, i.e. the save point (for example to save a partly done interview). The user needs to use direct navigation (menu) or close the form to allow saving.

Kind regards!

Sjlver · June 30, 2021, 2:16pm

Is there any possibility to use a finalise or send system event/status for your relevant? This would make the solution more automatic.

@wroos Just brainstorming: You might have some other required question that you could (ab)use for this. As long as it comes late in the form, after personal information is no longer needed, that should be fine.

Also, I’ve been wondering whether the following alternative would work: Instead of setting the relevant column for private fields, set the calculate and trigger columns. This can be used (I think) to transform the data at the moment when the trigger is selected. For example, you could use:

calculate = concat(substr(., 0, 1), '...') to keep the first character of the private value, but strip everything else. Maybe that would be good enough anonymization while still making the form results easier to interpret.
calculate = digest(., 'SHA-256') to hash the value.
You could even use something like if(starts-with(., 'HASH:'), ., concat('HASH:', digest(., 'SHA-256'))). This makes it idempotent, so that the value survives even if the enumerator selects and deselects the trigger field multiple times.

I haven’t tested any of these yet. If someone else does, I’d be happy to hear how it goes.