Alastair’s Place

Software development, Cocoa, Objective-C, life. Stuff like that.

Named Groups in ICU Regular Expressions

Python is a pretty neat language, and one of the best things about it is its runtime library. The only real downside is that Python itself isn't really very fast, but of course you can work around this by making good use of the runtime library (much of which is implemented in C), or—if all else fails—by writing your own extension module. (This is hardly a new situation; in the 1980s many of us used to write high-performance code in assembly language so we could use it from our BASIC or even C programs; some versions of BASIC even had integrated support for assembly language… this was true, for instance of BBC BASIC on the Acorn machines, and GFA BASIC on the Atari platform.)

Anyway, one of the ways that Python programmers, like Perl and Ruby programmers, optimise their code is to make heavy use of regular expressions. The regular expression engines in these languages are fast, and it's often much quicker to match a single pre-compiled regexp than it is to write Python code to scan a string. That in mind, Python has a feature called “named capture groups”; the Python library documentation says:

(?P<name>...)

Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named. So the group named 'id' in the example above can also be referenced as the numbered group 1.

For example, if the pattern is (?P<id>[a-zA-Z_]\w*), the group can be referenced by its name in arguments to methods of match objects, such as m.group('id') or m.end('id'), and also by name in pattern text (for example, (?P=id)) and replacement text (such as \g<id>).

(?P=name)

Matches whatever text was matched by the earlier group named name.

This is a great feature, and makes it much easier to write complicated regular expressions, since you can explicitly name the groups you want to extract and then you don't have to worry about whether or not the indices are all going to change when you add that one extra capture group that you need for whatever-it-is that you're doing.

However, and here's the problem, it's non-standard. Python supports it, but Ruby and Perl don't support it, and, as the regular expression syntax for the ICU library was derived from Perl syntax, that doesn't either.

Anyway, to cut a long story short, I want to use ICU with Python, and part of that means that I'd like regular expressions to work consistently (what I don't want is to find that my regexp matches using one or other API, but then because e.g. the Unicode character databases differ, things then break). So I've written a patch for ICU 3.4 that adds support for Python-syntax named capture groups.

Interestingly, the regexp library used by Ruby, Oniguruma, does support named groups, though with a different syntax to Python. I could have implemented that also (indeed, I think, with my patch, it can be done by just changing regexcst.txt and regenerating the associated header file), however the Oniguruma syntax looks like the kind of thing that might clash with features in future versions of Perl (it uses (?<name>...) and \k<name>, whereas Python put all of its additions inside (?P...) to avoid clashes).

Update: a new version of this patch is available.