Regex Problem in KoBoCollect

wroos · January 2, 2020, 7:15pm

Description

A regex constraint to check a text field for at least one occurrence of a word with 3 letters works differently in KoBoCollect, and we think wrong.
The regex result seems ok in Preview/Enketo, but wrong in KoBoCollect.

Steps to Reproduce

See attached XLSForm
BugRegexCollect01.xlsx (19.0 KB)

Import & deploy the form
Enter a first word with at least 3 lettersmore than one word …
Enter further words with at least 3 letters)
Do this in KoBoCollect Android smartphone (wrng).
And in Preview/Enketo (ok)…

Expected behavior

If at least one >= 3 letters word is entered, regex constraint should be ok, independent of further entries
Any regex constraint should work equally on KoBoCollect & Enketo
Any regex constraint in KoBo should comply with the standard. See e.g.

Actual behavior

KoBoCollect results in invalid constraint after entering anything more than the first word (even no blanks before or after!)

It works like expected in Preview/Enketo.

It is treated wrong, as if the regex would be:
regex( ., ‘(^[a-zA-Z]{3,})$’) - only 1 word, no spaces etc.

Additional details

OCHA Server XLSFom Windows 10). Android 5.0, Samsung Galaxy Note3 (SM-N900)
Two further Support questions*, please:

How can regex behavior in KoBoCollect depend on device system version?
Is there any documentation (link?)
In general which KoBoCollect features are sensitive, or even not working with specific Android versions. Like Mobile application is not working
Thanks.

Xiphware · January 3, 2020, 3:35pm

curious, does this regex behave better in KoboCollect?

regex(., '[a-zA-Z]{3}')

(it might help narrow down where the regex processor is misbehaving…)

FYI, KoboCollect (aka ODK Collect) uses the standard java library java.util.regex.Pattern library to perform regex matches. Whereas Enketo uses an entirely different JavaScript library RegExp(). They should behave the same but they are completely different codebases, so YMMV…

wroos · January 3, 2020, 8:15pm

Dear @Xiphware,
thanks, nice idea. But same result (of course now no more than 5 letters in this ONE word.)

In Collect: It seems as if the pattern search starts at entry position ONE and applies the pattern ONCE. If not matched, it stops at once!
In Enketo/Preview it’s all like expected (and with regex test tools).

I made some more examples. Have a look esp. on the last two. I think the { } are NOT the cause.
BugRegexCollect02.xlsx (20.9 KB)
I would be surprised, if the Java developer communities did not find the problem, if it’s in standard Java.
(Someone might test the examples with a java program?)

Here two know Collect problem screenshots, from the attached xls.

Again, it works well (like expected) in Enketo/Preview.

Can the problem depend on an Android version?? To check, someone might test with a newer version?

We still hope we do not need to remove all these constraints in our KoBoCollect app.

Kind regards
and Happy, healthy and peaceful New Year for you
Wolfgang

janna · January 3, 2020, 11:38pm

Hi @wroos,

Hope this might help, try this regex:

regex(., ‘([a-zA-Z ][a-zA-Z]{3,}[a-zA-Z ])’)

This is the explanation:

Allow a space: so, instead of [a-zA-Z], I believe you also want to allow someone to add in a space. So use this instead: [a-zA-Z ]

First, you need to say [a-zA-Z ]* with a star at the end to say they can enter this Zero or more times.

second, you then want to put in [a-zA-Z]{3,} (this has no space added) to say they need to have a three-letter string in there at least once.

Third, you then want to repeat [a-zA-Z }* with a space and a star at the end to say the entry should expect more letters or spaces Zero or more times.

I’ve tried this out, deployed it, and it works as I believe you want it to using a ODK Collect.

Cheers!
Janna

wroos · January 4, 2020, 11:38am

Dear @janna,
thanks for joining, esp. early in the year. Your workaround works for us partly.

Our requirement is just to control a minimum, that the user enters at least one word with at least 3 letters. Where ever in the text and with what ever other elements in this text.
We use this as a minimum for Other (specify) fields, just to avoid things like “…” or “?” (Only blanks as bad entry is covered by KoBo, when we set the field to required.)
So, we want to flexibly allow text like this: “a dog” OR " I saw 10 dogs, after getting-up, around 10:30… But camels = 0." But NOT a text like: " a bc d" or “?” or “101”.

With your regex formula, Enketo allows ‘10 dogs’ (or “start-up”). But KoBoCollect doesn’t. (Of course, we could add 0-9- and all other elements we want to allow and can be entered with a smartphone.).
I also tried: regex(., ‘([a-zA-Z]{3,})+’). It works well in Enketo/Preview, allowing things like " a dog" or “no dogs and cats” but NOT in KoBoCollect.

It’s like KoBoCollect is only applying a fixed pattern from position 1 on. I think, the behavior is NOT conform to the standard regex specification, which is working well in Enketo. Tell, me if I am wrong with that, please.

Beside a workaround, let us try to understand better how KoBoCollect generally treats regex (different to Enketo & the normal regex behavior), to be able to adapt regex expressions to work well. Also, to avoid that users are testing successfully with Preview, but the regex will not do the same after deployment in Collect.

THANKS again and have a Happy, peaceful and healthy New Year
Wolfgang

janna · January 4, 2020, 1:03pm

Hi @wroos,
Yes, the regex you’re using right now will only allow letters, not numeric digits and special characters.

If you want to allow all those, then we need to expand your expression to allow them.

So, you want to allow digits, then add “0-9” into your regex.

What special characters do you want to allow the user to enter? If you tell us, then we can help you create an expression that allows all the special characters you want.

Speak soon,
Janna

wroos · January 4, 2020, 6:00pm

Dear @janna,y
thanks for your offer.
I would primarily like to understand, how KoBoCollect generally treats any regex (and why different to Enketo and the normal regex behavior). This will allow us to adapt regex expressions to the Collect style.
I also took a look at three more articles:

Unfortunately, none of the given examples corresponds to ours. But the reference tool from Tino Kreutzer (@Tinok) will MATCH (except set “STICKY” flag), Like Enketo does. Here is the screenshot

(with global flag it will match twice).

I think, the difference between Collect and Enketo is NOT caused by the different Java library (see @Kal_Lam above). But someone of you might check this (outside Collect).

Still eager to understand the Collect regex treatment, and if this might be a bug.
Kind regards
Wolfgang

wroos · January 5, 2020, 2:14pm

Dear @janna, dear @Xiphware, dear @tinok,
Let me add a very simple example which works differently on KoBoCollect
regex(., ‘[a]+’)
Attachment with some more variants
ARegexCollect01.xlsx (12.8 KB)
RegEx tool

Screenshot Collect (after entered Ba)

Again Preview/Enketo does what is expected, but KoBoCollect allows only a (1-n times), nothing else. Collect treats the regex like regex(., ‘^[a]+$’).
(I didn’t find any documentation or explanation for this behavior.)
Waiting for your expert feedback, please.
Best regards

Xiphware · January 6, 2020, 2:53am

Thanks for your digging into this. I’ll try to confirm whether the latest actual ODK Collect source exhibits this behavior also, and report back. As you say, its less that there is probably some other suitable regex expression to accomplish what you desire, as there appears to be a markedly different behavior depending on what client you are running… That’s certainly undesirable behavior.

wroos · January 6, 2020, 10:34am

Dear @Xiphware,
Thanks!
As I coun’t find any hints in the two fora, it might still be the case.
Could you test it with other/newer Android versions? Are you sure it can’t be device-related? It would also be interesting to check, if really the used standard lib behaves like Collect.
Perhaps, Tino may also add a hint in his regex support article?
Have a nice day
Kind regards

janna · January 8, 2020, 10:17am

Hi @wroos,

This looks to be what is happening, comparing Collect and Enketo:

What does it look like Enketo is doing with the regex: It looks like it’s looking for any part of the text entry that matches your regex. This is also what the snapshot you posted above looks like it’s doing. The entire entry you show in your snapshot does not match the regex (s de 10 sdf d fgf). It says it has found 1 match in that string. And that’s what it looks like Enketo is doing - it’s allowing any text entry in your form, however it’s looking for at least one match to your regex, and that will then validate. This makes sense to me, because you’re not using a ^ and a $ at the beginning and end of the regex.

What does Collect do with the regex: When you put a regex into Collect - it looks like it’s checking the entire text entry in the form…and that has to match the regex that you’ve put as a constraint. Which is happening even though you haven’t used a ^ and a $ at the beginning and end of the regex.

Does anyone else know why this would be happening? @Xiphware?
thanks!
Janna

wroos · January 8, 2020, 11:27am

Dear @janna,
thanks for putting it together. I have the same understanding, and think Enketo corresponds to the normal regex behaviour, (at least outside life with Collect).

I have the following ideas to go on:

Can we make sure it is NOT a matter of Android version (device)
Can we make sure it’s NOT the different Java lib? (I.e. the lib does like Enketo.)
Can someone look in the code of Collect? (E.g. if wrong wrapping of regex lib)
After this discuss if this is a bug or a “feature” in Collect.
And document it in support article.

Waiting for the news
Kind regards

Xiphware · January 8, 2020, 3:22pm

I believe this may be a bug in javarosa (which is used by Collect/KoboCollect to evaluate XPath expressions). Am confirming with ODK devs and will report back…

Xiphware · January 9, 2020, 2:23am

FYI, presently javaRosa performs a

java.util.regex.Pattern.matches(re, str)

which, as defined, does the following [emphasis added]

public boolean matches()
Attempts to match the entire region against the pattern.

which would account for the observed behavior: the entire string is matched; ie implicitly adding "^...$" delimiters to your string. I suspect we should probably instead be using:

public boolean find()
Attempts to find the next subsequence of the input sequence that matches the pattern.

which would, correctly, look for a matching substring. Will confirm and if so the fix should be forthcoming [although it would still take some time for Kobo to pull latest ODK Collect/javaRosa fix].

From what I can tell, its probably safe to assume that at present KoboCollect currently (and incorrectly?) will only perform an full match of your regex against the target string; ie implicitly adding "^...$". So you should probably construct your regex accordingly until this is resolved.

wroos · January 9, 2020, 9:51am

Dear @Xiphware,
Thanks. I think we dug deep enough now.
Info might be added to Tino’s regex support article, I would like to suggest.
http://support.kobotoolbox.org/en/articles/592440-restricting-text-responses-with-regular-expressions

Best regards
Wolfgang

Xiphware · January 9, 2020, 4:11pm

We’ve identified the source of the problem in ODK javaRosa; it is being tracked here: https://github.com/opendatakit/javarosa/issues/531

The best solution needs some further discussion; specifically just fixing the bug by using Pattern.find() instead of Pattern.matches() could result in breaking a lot of existing deployed forms (!). So an alternative may well be to introduce a new, proper XPath spec function that does the job properly. Watch this space…

In mean time, as described above, your best bet is to construct your regex’s knowing that Collect will put a ^ and $ anchor around it. So if you want a regex like '[a]+' that will behave identically under both Collect and Enketo, you could do something like:

^.*[a]+.*$

tinok · January 15, 2020, 5:45pm

Thanks for digging into this and filing the issue on the JavaRosa side, @Xiphware! I guess from your findings Collect gives the same results whether a user specifies regex(.,'[0-9]{9}') or regex(.,'^[0-9]{9}$')?

I must admit I never noticed the difference either because in our work we always test for matches against the entire string, not partials. In the examples on our help page we list regex code such as regex(.,'^...$') to imply ‘match this whole expression’.

I’ve updated the help article to include the following:

Note that the Collect app will always check if the entire string of your regular expression matches the text provided in a response. This is equal to writing ^...$ around the regex code. Enketo Express instead will treat the regular expression literal and searches for any matches within the text. See this example for the regular expression regex(.,'[0-9]{9}'):

KoBoCollect: The validation is successful and the enumerator can continue to the next question if the response is exactly a 9 digit number
Enketo: The response is valid if it contains a 9 digit number. To get the same result as in KoBoCollect simply write regex(.,' ^[0-9]{9}$') to indicate that the whole string must be matched.

wroos · January 15, 2020, 10:45pm

Dear Tinok,
thanks!

Yes, I can confirm this from all our tests. Even just regex(.,‘a+’) shows the difference, as KoBoCollect will only allow the sense of regex(.,‘^a+$’), i.e. only an ‘a’ or word of 'a’s like ‘aaaa’ as entry.

We started with regex to make sure that flexible other text entries will contain at least one alpha word of 3 letters. (Maybe instead of regex we will use the contain function.)
Kind regards

wroos · February 17, 2020, 6:15pm

GitHub news to find here