Alastair’s Place

Software development, Cocoa, Objective-C, life. Stuff like that.

Named Groups and Conditionals in ICU Regexps

Since my original post on this topic, I've done a bit more work on the ICU regexp engine. It now supports named groups, and conditional expressions, including support for group numbers and names as well as lookahead and lookbehind expressions as the conditions on which to branch.

The syntax supported by this patch now includes:

(?P<name>)
Named capture group. Can be accessed via overloaded group(), start() and end() methods from C++, or using the uregex_groupIndexFromName() function from C.
(?P=name)
Named backreference.
(?(name-or-id)then-part|else-part)
Conditional expression; if the group identified by the numeric ID or name has been matched, then attempt a match against then-part, otherwise against else-part. The else-part is optional, in which case the '|' should also be omitted.
(?(?expr)then-part|else-part)
Conditional expression using a lookahead or lookbehind expression in place of ?expr. In this case, a match is attempted with the lookahead or lookbehind expression, and the result (true or false) used to choose whether to execute the then-part or the else-part as appropriate.

This version of the patch also fixes a couple of bugs in the previous version, simplifies the implementation of named capture groups (and at the same time disallows multiple capture groups with the same name), and adds quite a few new tests to the intltest and cintltest test suites to verify the operation of the new code.

It also adds the C functions uregex_namedGroupName() and uregex_namedGroupCount(), and the C++ method getGroupNames(), which provide a means for C and C++ code to obtain the names of the named capture groups for a particular RegexPattern object.