Python script to batch upload old data to a new form or account (including images too)


I wanted to share a script I’ve been working on that allows data to be batch uploaded to the kobo server. It can transfer all the data including images from a kobo export to a new form / account, even if the new form has a different layout (headings / groups have different names).

This is a more flexible extrension of what josh described here:

I’m not at all a programmer so its kindof hacky… but it works :slight_smile:

The forum server wont let me upload py or text files so I’ll copy the python code and readme directly below.
You need to create a folder containing the below files, plus a csv with the data you want to upload, and a subfolder containing your photos (if any). See the readme below for more detailed instructions.

Warning: GPS coordinates are not handled automatically by the correspondance template at the moment. when you export data to excel, for some reason kobo removes any groups from the heading of the GPS column. To correctly transfer GPS points, you need to edit the GPS point line in the new_headings column in the correspondence_template.
For example, copying the headings from a data export may give GPS headings like this:


which needs to be manually edited to read:


the old_headings column should not be edited.

Warning 2: untested with repeat questions

Please post any questions and I’ll try to answer them as best I can.


required excel file:
correspondance_template.xlsx (888.0 KB)

readme (save into

# kobo_uploadscript to batch upload kobo data (and images) to another form or server
see below notes for usage

1/ prepare config file
modify to match your API token, form uid, and base url.

2/ prepare submission_data.csv
to prepare the data you want to upload, download the excel data from kobo server by cliking on project -> data -> downloads.
export the data with these settings: 
#   select export type: xls
#   value and header format: xml values and headers
#   Export Select Many questions as… : Single column
#   Include data from all 9 versions : tick
#   Include groups in headers: tick
#   group separator: /
#   Store date and number responses as text : untick
#   Include media URLs : tick
download the exported file and place it in the same folder as the script.
in excel, open the downloaded file and click save as  with filename "submission_data" and file type CSV-UTF8

3/ define how to link the old data to the target form
set up the correspondance_template to define how to link headings in the old data with headings in the target form. 
export data from the target form with "value and header format": "xml values and headers", and copy the first row into correspondance_template, into the column "new_headings" on the "input" tab
If the data you are uploading comes from a different version of the form than the target, copy the headings from the submission_data file and paste them into correspondance_template, into the column "old_headings" on the "input" tab. cut and paste the values until they line up correctly with the new headings.
If the data you are uploading has the same format, copy the xml headings into both the new and old headings columns
go through each line and put an x in the "mark_x_for_media" next to each new heading that will be a media file (tested with photos. milage may vary with other media types).
CORRESPONDANCE_FILE = 'correspondance_template.xlsx'

4/ add any media files
to get all media, download from project -> data -> downloads and export type: media files/zip. 
create a folder next to the script called "photos" and unzip the downloaded zip archive into it.
for very large projects with many images, kobo server may timeout preparing or downloading the zip file. as a workaround:
	- open submission_data in excel and save as type webpage / html. 
	- you can open the created page in firefox and use downthemall or equivilent to download all the image files from their urls (make sure Include media URLs was ticked during data export if you cant see any urls). 
	- the images must be placed in the same file structure as the zipped kobo media export, otherwise the script wont find them. if you play around with the downthemall settings you can download them correctly

5/ run script
in powershell move to the folder with the script and run with :
python kobo_batchuploader.
if you get errors you may need to install the dependancies with pip or similar first.
successful uploads are recorded in successful_submissions.csv. if you rerun the script those lines will be skipped. so you can stop half way and rerun the script without issues.

config file (save as

# see readme for more info

#get token from kobo webpage under account settings -> security -> APIkey
TOKEN = 'paste_token_here'

#get form_UID from the form url eg[form_uid]/landing
FORM_UID = "paste_form_uid_here"

#change if needed, either or
SUBMISSION_URL = f'{BASE_URL}/api/v1/submissions'

#names/ locations of required files
DATA_FILE = 'submission_data.csv'
CORRESPONDANCE_FILE = 'correspondance_template.xlsx'
PHOTO_PATH = "./photos/"

#maybe not working

upload script (save as
#tested on python 3.11.4 / windows 10.

import io
import requests
import uuid
from datetime import datetime
from upload_config import *
from pathlib import Path
import math
from time import sleep
#import lxml.etree as etree
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import traceback
import re
import os

def format_openrosa_datetime():
    return'T', 'milliseconds')

def create_xml_submission(_uuid, csv_dict, xml_template):
    uid = FORM_UID
    head = f'''
    <{uid} id="{uid}" version="1 (2021-03-25 18:06:48)">

    body = str(xml_template) % csv_dict

    foot = f'''

    xml_payload = (head+body+foot)

    return xml_payload.encode()

def submit_data(i, csv_dict, csv_header, xml_template, image_dict, missing_images, number_skipped):
        _uuid = csv_dict["_uuid"]
        _uuid = str(uuid.uuid4())

    #skip submissions that miss images, if option is set
    if missing_images > 0 and SUBMIT_WITH_MISSING_IMAGES == False:
        number_skipped = number_skipped + 1
        result_text = f'''submission {i} : Images missing. Skipping'''
        return result_text, number_skipped
    # create successful submission log file if it doesnt exist
        file = open('successful_submissions.txt', 'r')
    except IOError:
        file = open('successful_submissions.txt', 'w')
    # skip if the submission was already made successfully
    with open('successful_submissions.txt', 'r') as f:
        if _uuid in
            result_text = f'''submission {i} : Already done. Skipping'''
            return result_text, number_skipped

    file_tuple = (_uuid, io.BytesIO(create_xml_submission(_uuid, csv_dict, xml_template)))
    files = {'xml_submission_file': file_tuple}

    for image in image_dict:
        file_name = image_dict[image]
        another_file_tuple = (file_name, open(file_name, 'rb'))
        files = {**files, file_name: another_file_tuple}
    res = ""
    headers = {'Authorization': f'Token {TOKEN}'}
    req = requests.Request(method='POST', url=SUBMISSION_URL, files=files, headers=headers ).prepare()
    num_retry = 0

    while num_retry < MAX_RETRIES:
            session = requests.Session()
            retry = Retry(connect=3, backoff_factor=0.5)
            adapter = HTTPAdapter(max_retries=retry)
            session.mount('http://', adapter)
            session.mount('https://', adapter)
            res = session.send(req)
            if res.status_code == 201:
                with open("successful_submissions.txt", "a") as successfile:
                    successfile.write(_uuid + "\n")
                result_text = f'''submission {i} : {res.status_code} : Success'''
                return result_text, number_skipped
            if res.status_code == 202:
                with open("successful_submissions.txt", "a") as successfile:
                    successfile.write(_uuid + "\n")
                result_text = f'''submission {i} : {res.status_code} : Duplicate submission. Continuing.'''
                return result_text, number_skipped
            print (f'''
submission {i} : {res.status_code} : Server returned error. Response:
            num_retry = num_retry + 1
        except requests.Timeout as error:
            num_retry = num_retry + 1
            print(f'''submission {i} : Timeout: {error}. Retrying...''')
        except requests.ConnectionError as error:
            num_retry = num_retry + 1
            print (f'''submission {i} : Connection error: {error}. Retrying...''')            
    number_skipped = number_skipped + 1
    result_text = f'''submission {i} : Skipping.''', number_skipped
    return result_text, number_skipped
def get_image_path(image_name, uuid, urls_dict):
    paths = []
    startpath = ".\\"
    for path in Path(f"{PHOTO_PATH}").glob(f"*/attachments/*/{uuid}/{image_name}"):
        path = startpath+str(path)
    if len(paths) > 1:
        print(f'''WARNING: multiple paths found for {image_name}: 
        Using the first path found.''')
    if len(paths) == 0:
        print(f'''WARNING: image {image_name} not found at {PHOTO_PATH}/*/attachments/*/{uuid}/{image_name}. Check  that images are unzipped in the correct location. PHOTO_PATH can be changed in "". Verify the media is marked correctly in correspondance template. Trying to download...''')
        trimlength = len(BASE_URL) + 27
        alt_path = str(Path(f"{PHOTO_PATH}"+urls_dict[image_name].replace("%2F","/")[trimlength:]))
        slash = [m.start() for m in re.finditer(r"\\",alt_path)]
        alt_path = startpath + alt_path[:slash[3]+1]+ uuid + alt_path[slash[4]:]
        url = urls_dict[image_name]
            image_data = requests.get(url)
            os.makedirs(os.path.dirname(alt_path), exist_ok=True)
            with open(alt_path, 'wb') as f:
            print(f'''{image_name} saved to {alt_path}''')

            print(f'''failed, adding to missing images file''')
            # write missing image and url to missing images file  
                file = open('missing_images.txt', 'r')
            except IOError:
                file = open('missing_images.txt', 'w')
            with open("missing_images.txt", "r") as f:
                if image not in
                    open("missing_images.txt", "a").write(image + ", " + url + ", " + image_path + "\n")
    return paths[0]

def get_xml_data ():
    import pandas as pd
    xml_df = pd.DataFrame(pd.read_excel(CORRESPONDANCE_FILE, sheet_name="result", header=0))
    xml_template_pd             = xml_df.loc[: , "generated_xml"]
    xml_template = '\n'.join([str(line) for line in xml_template_pd])
    media_now                   = xml_df.loc[: , "media_now"]
    headings_to_use    = xml_df.loc[: , "oldQs_full"]
    media_dict = dict(zip(headings_to_use, media_now))
    xml_data = [xml_template , headings_to_use , media_dict, media_now]
    return xml_data

def get_submission_data(headings_to_use):
    import pandas as pd
    submissiondata_df = pd.DataFrame(pd.read_csv(DATA_FILE, encoding='utf-8', usecols = headings_to_use)).fillna('')
    return submissiondata_df

def prepare_submission(i, xml_data, submissiondata_df):
    xml_template = xml_data[0] 
    headings_to_use = xml_data[1] 
    media_dict = xml_data[2] 
    media_now = xml_data[3] 
    csv_dict = {}
    image_path = None
    image_names = []
    image_dict = {}
    urls_dict = {}
    this_row = submissiondata_df.iloc[i][headings_to_use]
    #read vars:
    for j in range(0, len(headings_to_use)):
            this_data = this_row[j].replace("<","&#60;").replace("&","&#38;").replace(">","&#62;").replace("'","&#39;").replace('"',"&#34;")
            this_data = this_row[j]
        csv_dict = {**csv_dict, headings_to_use[j]: this_data} 
        #collect image names
        if media_now[j]:
            #spaces in image names are replaced by kobo server with underscore, but remain as spaces in the database
            image_name = this_row[j].replace(" ","_")
            image_url = this_row[j+1]
            notnan = not(image_name == '' or (isinstance(image_name, float) and  math.isnan(image_name)))
            if notnan:
                urls_dict = {**urls_dict, image_name: image_url} 
    # add any images to the image dictionary
    missing_images = 0
    for image in image_names:
            this_uuid = csv_dict["_uuid"]
                image_path = get_image_path(image, this_uuid, urls_dict)
                image_dict = {**image_dict, image : image_path}    
            except Exception:
                print("missing image error")
                missing_images = missing_images + 1
        except Exception:
            print(f"uuid for {image} not found" )
    return csv_dict, image_dict, missing_images

if __name__ == '__main__':
    xml_data = get_xml_data ()
    xml_template = xml_data[0]
    headings_to_use = xml_data[1]
    submissiondata_df = get_submission_data(headings_to_use)
    number_skipped = 0
    for i in range(0, len(submissiondata_df)):
        csv_dict, image_dict, missing_images = prepare_submission(i, xml_data, submissiondata_df)
        res, number_skipped = submit_data(i, csv_dict, headings_to_use, xml_template, image_dict, missing_images, number_skipped)
    print("Done. Number skipped: " + str(number_skipped))

This is interesting. Please,can you share the video description of this?

1 Like

@ralphthompson, thank you for sharing it with the entire community. Maybe you could share the py and txt files through a google doc so those interested could download them if needed. Once again, :clap: :heart: :partying_face:

Expecting similar support in the upcoming days too …

@ralphthompson Thank you so much for this. I have been trying to make this work for months. I have two comment for you:

  1. Regarding skipping submissions, the number_skipped value is not increased both when the submission is duplicate and when it is “Already done” I edited the code by incrementing number_skipped in both instances and it worked perfectly.
  2. I only tested this with a small form that has only two questions, does your code work with repeat_group? Is there any additions that need to be considered for repeats?

For people trying this method, the key is to make the correspondance_template.xlsx ,matches your form. In my case, the old_headings matched the new_headings so I just copied the same column names in both headings and removed the “x” because I did not have a media type question.

Hi Isslam,

for 1, in my case the counter of number_skipped was to know if I needed to rerun the script, so duplicate or already done submissions were not counted. feel free to adapt to your needs :wink:
2 i have not tested it with repeat type questions, but if i remember correctly the data from repeats gets stored in multiple sheets, so i guess it will break the script which expects a single csv as the submission data. if anyone can get it to work with repeating questions please share !