Python script to batch upload old data to a new form or account (including images too)

ralphthompson · July 30, 2023, 10:26am

Hi,

I wanted to share a script I’ve been working on that allows data to be batch uploaded to the kobo server. It can transfer all the data including images from a kobo export to a new form / account, even if the new form has a different layout (headings / groups have different names).

This is a more flexible extrension of what josh described here:

I’m not at all a programmer so its kindof hacky… but it works

The forum server wont let me upload py or text files so I’ll copy the python code and readme directly below.
You need to create a folder containing the below files, plus a csv with the data you want to upload, and a subfolder containing your photos (if any). See the readme below for more detailed instructions.

Warning: GPS coordinates are not handled automatically by the correspondance template at the moment. when you export data to excel, for some reason kobo removes any groups from the heading of the GPS column. To correctly transfer GPS points, you need to edit the GPS point line in the new_headings column in the correspondence_template.
For example, copying the headings from a data export may give GPS headings like this:

GPS_Location	
group_fp6yl02/_GPS_Location_latitude	
group_fp6yl02/_GPS_Location_longitude	
group_fp6yl02/_GPS_Location_altitude	
group_fp6yl02/_GPS_Location_precision

which needs to be manually edited to read:

group_fp6yl02/GPS_Location	
group_fp6yl02/_GPS_Location_latitude	
group_fp6yl02/_GPS_Location_longitude	
group_fp6yl02/_GPS_Location_altitude
group_fp6yl02/_GPS_Location_precision

the old_headings column should not be edited.

Warning 2: untested with repeat questions

Please post any questions and I’ll try to answer them as best I can.

Thanks
Ralph

required excel file:
correspondance_template.xlsx (888.0 KB)

readme (save into readme.md)

# readme.md
# kobo_uploadscript to batch upload kobo data (and images) to another form or server
see below notes for usage

1/ prepare config file
----------------------
modify upload_config.py to match your API token, form uid, and base url.


2/ prepare submission_data.csv
------------------------------
to prepare the data you want to upload, download the excel data from kobo server by cliking on project -> data -> downloads.
export the data with these settings: 
#   select export type: xls
#   value and header format: xml values and headers
#   Export Select Many questions as… : Single column
#   Include data from all 9 versions : tick
#   Include groups in headers: tick
#   group separator: /
#   Store date and number responses as text : untick
#   Include media URLs : tick
download the exported file and place it in the same folder as the script.
in excel, open the downloaded file and click save as  with filename "submission_data" and file type CSV-UTF8

3/ define how to link the old data to the target form
-----------------------------------------------------
set up the correspondance_template to define how to link headings in the old data with headings in the target form. 
export data from the target form with "value and header format": "xml values and headers", and copy the first row into correspondance_template, into the column "new_headings" on the "input" tab
If the data you are uploading comes from a different version of the form than the target, copy the headings from the submission_data file and paste them into correspondance_template, into the column "old_headings" on the "input" tab. cut and paste the values until they line up correctly with the new headings.
If the data you are uploading has the same format, copy the xml headings into both the new and old headings columns
go through each line and put an x in the "mark_x_for_media" next to each new heading that will be a media file (tested with photos. milage may vary with other media types).
CORRESPONDANCE_FILE = 'correspondance_template.xlsx'

4/ add any media files
----------------------
to get all media, download from project -> data -> downloads and export type: media files/zip. 
create a folder next to the script called "photos" and unzip the downloaded zip archive into it.
for very large projects with many images, kobo server may timeout preparing or downloading the zip file. as a workaround:
	- open submission_data in excel and save as type webpage / html. 
	- you can open the created page in firefox and use downthemall or equivilent to download all the image files from their urls (make sure Include media URLs was ticked during data export if you cant see any urls). 
	- the images must be placed in the same file structure as the zipped kobo media export, otherwise the script wont find them. if you play around with the downthemall settings you can download them correctly

5/ run script
-------------
in powershell move to the folder with the script and run with :
python kobo_batchuploader.
if you get errors you may need to install the dependancies with pip or similar first.
successful uploads are recorded in successful_submissions.csv. if you rerun the script those lines will be skipped. so you can stop half way and rerun the script without issues.

config file (save as upload_config.py)

# upload_config.py
# see readme for more info

#get token from kobo webpage under account settings -> security -> APIkey
TOKEN = 'paste_token_here'

#get form_UID from the form url eg https://kobo.humanitarianresponse.info/#/forms/[form_uid]/landing
FORM_UID = "paste_form_uid_here"

#change if needed, either https://kc.kobotoolbox.org or https://kc.humanitarianresponse.info
BASE_URL = 'https://kc.humanitarianresponse.info'
SUBMISSION_URL = f'{BASE_URL}/api/v1/submissions'

#names/ locations of required files
DATA_FILE = 'submission_data.csv'
CORRESPONDANCE_FILE = 'correspondance_template.xlsx'
PHOTO_PATH = "./photos/"


#maybe not working
SUBMIT_WITH_MISSING_IMAGES = False
MAX_RETRIES = 5

upload script (save as kobo_batchuploader.py)

#kobo_batchuploader.py
#tested on python 3.11.4 / windows 10.

import io
import requests
import uuid
from datetime import datetime
from upload_config import *
from pathlib import Path
import math
from time import sleep
#import lxml.etree as etree
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import traceback
import re
import os

def format_openrosa_datetime():
    return datetime.now().isoformat('T', 'milliseconds')

def create_xml_submission(_uuid, csv_dict, xml_template):
    uid = FORM_UID
    head = f'''
    <{uid} id="{uid}" version="1 (2021-03-25 18:06:48)">
        <formhub>
            <uuid>d0f06b3a0cce4a4f9c69de7fb6c10f20</uuid>
        </formhub> 
    '''
    

    body = str(xml_template) % csv_dict

    foot = f'''
        <__version__>vWG8dR5D2BhooJjhQ8zrqy</__version__>
        <meta>
            <instanceID>uuid:{_uuid}</instanceID>
        </meta>
    </{uid}>
    '''
    

    
    xml_payload = (head+body+foot)

    return xml_payload.encode()

def submit_data(i, csv_dict, csv_header, xml_template, image_dict, missing_images, number_skipped):
    try:
        _uuid = csv_dict["_uuid"]
    except:
        _uuid = str(uuid.uuid4())

    #skip submissions that miss images, if option is set
    if missing_images > 0 and SUBMIT_WITH_MISSING_IMAGES == False:
        number_skipped = number_skipped + 1
        result_text = f'''submission {i} : Images missing. Skipping'''
        return result_text, number_skipped
    # create successful submission log file if it doesnt exist
    try:
        file = open('successful_submissions.txt', 'r')
        file.close()
    except IOError:
        file = open('successful_submissions.txt', 'w')
        file.close()
        
    # skip if the submission was already made successfully
    with open('successful_submissions.txt', 'r') as f:
        if _uuid in f.read():
            result_text = f'''submission {i} : Already done. Skipping'''
            return result_text, number_skipped
            



        
            
    file_tuple = (_uuid, io.BytesIO(create_xml_submission(_uuid, csv_dict, xml_template)))
    files = {'xml_submission_file': file_tuple}

    for image in image_dict:
        file_name = image_dict[image]
        another_file_tuple = (file_name, open(file_name, 'rb'))
        files = {**files, file_name: another_file_tuple}
    res = ""
    headers = {'Authorization': f'Token {TOKEN}'}
    req = requests.Request(method='POST', url=SUBMISSION_URL, files=files, headers=headers ).prepare()
    
    
    num_retry = 0

    while num_retry < MAX_RETRIES:
        try:
            session = requests.Session()
            retry = Retry(connect=3, backoff_factor=0.5)
            adapter = HTTPAdapter(max_retries=retry)
            session.mount('http://', adapter)
            session.mount('https://', adapter)
            res = session.send(req)
            
            if res.status_code == 201:
                with open("successful_submissions.txt", "a") as successfile:
                    successfile.write(_uuid + "\n")
                result_text = f'''submission {i} : {res.status_code} : Success'''
                return result_text, number_skipped
            if res.status_code == 202:
                with open("successful_submissions.txt", "a") as successfile:
                    successfile.write(_uuid + "\n")
                result_text = f'''submission {i} : {res.status_code} : Duplicate submission. Continuing.'''
                return result_text, number_skipped
            print (f'''
submission {i} : {res.status_code} : Server returned error. Response:
    {res.text}
Retrying...
    ''')
            num_retry = num_retry + 1
        except requests.Timeout as error:
            num_retry = num_retry + 1
            print(f'''submission {i} : Timeout: {error}. Retrying...''')
        except requests.ConnectionError as error:
            num_retry = num_retry + 1
            print (f'''submission {i} : Connection error: {error}. Retrying...''')            
            sleep(5)
    number_skipped = number_skipped + 1
    result_text = f'''submission {i} : Skipping.''', number_skipped
    return result_text, number_skipped
    
    
def get_image_path(image_name, uuid, urls_dict):
    paths = []
    startpath = ".\\"
    for path in Path(f"{PHOTO_PATH}").glob(f"*/attachments/*/{uuid}/{image_name}"):
        path = startpath+str(path)
        paths.append(path)
        
  
    if len(paths) > 1:
        print(f'''WARNING: multiple paths found for {image_name}: 
        {paths}
        Using the first path found.''')
    if len(paths) == 0:
        print(f'''WARNING: image {image_name} not found at {PHOTO_PATH}/*/attachments/*/{uuid}/{image_name}. Check  that images are unzipped in the correct location. PHOTO_PATH can be changed in "upload_config.py". Verify the media is marked correctly in correspondance template. Trying to download...''')
        
        trimlength = len(BASE_URL) + 27
        alt_path = str(Path(f"{PHOTO_PATH}"+urls_dict[image_name].replace("%2F","/")[trimlength:]))
        slash = [m.start() for m in re.finditer(r"\\",alt_path)]
        alt_path = startpath + alt_path[:slash[3]+1]+ uuid + alt_path[slash[4]:]
        url = urls_dict[image_name]
        
        try:
            image_data = requests.get(url)
            os.makedirs(os.path.dirname(alt_path), exist_ok=True)
            with open(alt_path, 'wb') as f:
                f.write(image_data.content)
            print(f'''{image_name} saved to {alt_path}''')

            paths.append(alt_path)
        except:
            traceback.print_exc()
            print(f'''failed, adding to missing images file''')
            # write missing image and url to missing images file  
            try:
                file = open('missing_images.txt', 'r')
                file.close()
            except IOError:
                file = open('missing_images.txt', 'w')
                file.close()
            with open("missing_images.txt", "r") as f:
                if image not in f.read():
                    open("missing_images.txt", "a").write(image + ", " + url + ", " + image_path + "\n")
        
    
    return paths[0]



def get_xml_data ():
    import pandas as pd
    xml_df = pd.DataFrame(pd.read_excel(CORRESPONDANCE_FILE, sheet_name="result", header=0))
    xml_template_pd             = xml_df.loc[: , "generated_xml"]
    xml_template = '\n'.join([str(line) for line in xml_template_pd])
    #print(xml_template_pd)
    
    media_now                   = xml_df.loc[: , "media_now"]
    headings_to_use    = xml_df.loc[: , "oldQs_full"]
    media_dict = dict(zip(headings_to_use, media_now))
    xml_data = [xml_template , headings_to_use , media_dict, media_now]
    return xml_data



def get_submission_data(headings_to_use):
    import pandas as pd
    submissiondata_df = pd.DataFrame(pd.read_csv(DATA_FILE, encoding='utf-8', usecols = headings_to_use)).fillna('')
    print(submissiondata_df)
    return submissiondata_df



def prepare_submission(i, xml_data, submissiondata_df):
    xml_template = xml_data[0] 
    headings_to_use = xml_data[1] 
    media_dict = xml_data[2] 
    media_now = xml_data[3] 
    csv_dict = {}
    image_path = None
    image_names = []
    image_dict = {}
    urls_dict = {}
    
    this_row = submissiondata_df.iloc[i][headings_to_use]
    
    #read vars:
    for j in range(0, len(headings_to_use)):
        try:
            this_data = this_row[j].replace("<","&#60;").replace("&","&#38;").replace(">","&#62;").replace("'","&#39;").replace('"',"&#34;")
        except:
            this_data = this_row[j]
        csv_dict = {**csv_dict, headings_to_use[j]: this_data} 
        
        #collect image names
        if media_now[j]:
            #spaces in image names are replaced by kobo server with underscore, but remain as spaces in the database
            image_name = this_row[j].replace(" ","_")
            image_url = this_row[j+1]
            notnan = not(image_name == '' or (isinstance(image_name, float) and  math.isnan(image_name)))
            if notnan:
                image_names.append(image_name)
                urls_dict = {**urls_dict, image_name: image_url} 
    # add any images to the image dictionary
    missing_images = 0
    for image in image_names:
        try: 
            this_uuid = csv_dict["_uuid"]
            try:
                image_path = get_image_path(image, this_uuid, urls_dict)
                image_dict = {**image_dict, image : image_path}    
            except Exception:
                print("missing image error")
                traceback.print_exc()
                missing_images = missing_images + 1
        except Exception:
            print(f"uuid for {image} not found" )
            traceback.print_exc()
           
    return csv_dict, image_dict, missing_images



if __name__ == '__main__':
    print('''
    ''')
    xml_data = get_xml_data ()
    xml_template = xml_data[0]
    headings_to_use = xml_data[1]
    
    submissiondata_df = get_submission_data(headings_to_use)
    number_skipped = 0
    
    for i in range(0, len(submissiondata_df)):
        csv_dict, image_dict, missing_images = prepare_submission(i, xml_data, submissiondata_df)
        
        res, number_skipped = submit_data(i, csv_dict, headings_to_use, xml_template, image_dict, missing_images, number_skipped)
        print(res)
    print("Done. Number skipped: " + str(number_skipped))

chiabotu · July 30, 2023, 12:23pm

This is interesting. Please,can you share the video description of this?

Kal_Lam · July 30, 2023, 2:27pm

@ralphthompson, thank you for sharing it with the entire community. Maybe you could share the py and txt files through a google doc so those interested could download them if needed. Once again,

Expecting similar support in the upcoming days too …

Isslam · July 30, 2023, 8:57pm

@ralphthompson Thank you so much for this. I have been trying to make this work for months. I have two comment for you:

Regarding skipping submissions, the number_skipped value is not increased both when the submission is duplicate and when it is “Already done” I edited the code by incrementing number_skipped in both instances and it worked perfectly.
I only tested this with a small form that has only two questions, does your code work with repeat_group? Is there any additions that need to be considered for repeats?

For people trying this method, the key is to make the correspondance_template.xlsx ,matches your form. In my case, the old_headings matched the new_headings so I just copied the same column names in both headings and removed the “x” because I did not have a media type question.

ralphthompson · July 31, 2023, 3:24am

Hi Isslam,

for 1, in my case the counter of number_skipped was to know if I needed to rerun the script, so duplicate or already done submissions were not counted. feel free to adapt to your needs
2 i have not tested it with repeat type questions, but if i remember correctly the data from repeats gets stored in multiple sheets, so i guess it will break the script which expects a single csv as the submission data. if anyone can get it to work with repeating questions please share !

Akratos · January 28, 2024, 8:09pm

Hi @ralphthompson, thanks for sharing the code.

In my case I am not using images or geocoordinates, so I commented out the lines of code corresponding to what has to do with loading images and made other arrangements, but I found that the empty data dictionary is sent. I’ve been checking and the file correspondence_template.xlsx seems to be wrong because it has a link to a file called input.xlsx which causes a #REF in the _calc sheet. I have broken the link, but it continues with the same error and the Excel sheet remains empty without generating any type of result in any of the three sheets when the column headers (new_headings and old_headings) are assigned.

Please could you review the excel workbook and share it again?

Once I have the book corrected, I will load the data again, I will let you know how the code is and I will share it with the changes I made.

Greetings.

rehanmaqbul · February 6, 2024, 12:50pm

Hi @ralphthompson and @Kal_Lam I don’t know if this still works, Before I proceed, I want to confirm if it’s possible for me to edit the data stored on the server. I specifically need to change the values within the XML files, not the labels.

Additionally, I encountered an issue when I modified the XML values in a sample form and redeployed it. Despite making changes, the old XML values persisted when I downloaded the data. Furthermore, the skip function didn’t work properly in the updated form.

Given that we’ve collected over 2500 pieces of complex data from heart patients, I’m wondering if I should update the dataset on the server (assuming the XML values can be modified) or if there’s another solution to edit the XML values in the existing dataset. I’d appreciate any guidance on this matter.

rehanmaqbul · February 6, 2024, 6:53pm

@ralphthompson does it works with new eu.kobotoolbox too?

ralphthompson · February 7, 2024, 12:59am

Hi @rehanmaqbul,

I haven’t tested it with the new servers so your mileage may vary. About 6 months ago I left the job where I was working with kobo regularly so I haven’t really kept up to date since then.
When you say you want to change the values in the xml files, you mean you want to manually edit the database? If so,if it’s just a few records you can do it manually inside the kobo server. Otherwise you can download the data and modify it in excel, then use the script to reupload it ( I would do it to a clone of your survey, and keep the original data as a backup incase it goes wrong).

@Akratos - I’ll try take a look at the broken excel link in the next few days. If you can share / pm your version of correspondance template and any output files / text you got, it will help me look for errors.

ralphthompson · February 7, 2024, 4:57pm

@Akratos hi, i downloaded the correspondence template from the link above to a clean folder to test but I cant replicate your error. I can see links in the sheet to the ‘input’ tab, but nothing to another file input.xlsx, and for me the sample is giving the correct xml as output. maybe try redownloading the original and making any edits again carefully, checking if anything is breaking the sheet as you go. it sounds like maybe when you removed lines of code, the script is not receiving what it expects . if you don’t have any x’s in the media column of the excel, the script lines related to media won’t activate anyway, so maybe try testing without modifying the code first, once its working then start to edit to fit your needs

Akratos · February 7, 2024, 5:33pm

Continuing the discussion from Python script to batch upload old data to a new form or account (including images too):

Thank you for your response @ralphthompson As soon as I open the Excel workbook (original), an alert message appears that says:
“This book contains links to one or more external sources that may not be secure”
When I click the update option on that alert, it throws me another alert: “We can’t update some of the links in your book right now.” I click on the “Edit Links…” option and see the error “source not found”.
I eliminate the “x” because I don’t have images. I copy the names of my fields into new_headings and old_headings (both columns the same) but absolutely no new values appear in the yellow-shaded boxes and I assume that values should appear there according to the formulas contained in those cells. Initially, also do not value values in those cells with the fields you have registered.
That is the problem I experience.

yunamed · March 1, 2024, 6:42pm

Hi, @ralphthompson

Please this is a follow up on @rehanmaqbul 's ask on the new eu.kobotoolbox.org servers. Please does the script still work for those servers?

ralphthompson · March 2, 2024, 6:20pm

Hi, I think it should still work, but you will need to change the value for BASE_URL in the file upload_config.py

surveyorjr · March 3, 2024, 7:46am

Hi @ralphthompson thanks very much. . Just a quick one. Do you have a GitHub Repo for this source code that we can also fork/clone and also contribute towards for any additions and customizations and use cases for the community?

This is great. Thanks very much.

ralphthompson · March 4, 2024, 6:36pm

Hi @surveyorjr , good idea, I just added it here: GitHub - ralph-thompson/kobo_uploader: script to batch upload data and media files to kobo servers

surveyorjr · March 5, 2024, 1:08pm

This is awesome @ralphthompson , thanks

chiabotu · March 6, 2024, 6:08am

This is nice.

Mat23 · May 6, 2024, 5:44pm

Hi @ralphthompson

Thanks so much for that script, exactly what I have been wanting to do!

I am having some issues setting it up for kf_kobotoolbox_org, and would really appreciate if any one would be able to help me.

I am fairly new to working with APIs, but it seems to me that the API end points have changed quite a bit with the new servers. I was not able to find an equivalent for “kc_kobotoolbox_org/api/v1/submissions” for “kf_kobotoolbox_org”

I went through “kf_kobotoolbox.org/api/v2/” but could not ID what would be the equivalent end point. ChatGPT kindly suggested:
“KoboToolbox Form Building API<asset_uid>/submissions/”
But I do not think it actually exist (got 404 errors when trying that)

I am able to run the script when using the previous kc_kobotoolbox_org API endpoint, and get 201/202 answers from the servers (though I don’t know where the data actually gets sent, since I would imagine there is nothing matching my form ID over there…).

If anyone has been able to figure it out, I would greatly appreciate it

(I am having a hard time with the “2 links per post from newbies” rule, had to edit a few things)

Mat23 · June 11, 2024, 12:39am

Alright, this post here completely solved the issue: