Alastair’s Place

Software development, Cocoa, Objective-C, life. Stuff like that.

Character Encoding and MovableType

I’m beginning to suspect that MovableType has a bug in it relating to character encoding, though the “smart” behaviour of some web browsers makes it very difficult to tell where the problem actually lies.

What I’m seeing is that the characters are encoded incorrectly on the index pages, but not on the articles’ individual pages. At least, I think that’s what I’m seeing.

Permalinks With Dashes

Hopefully, permalinks should now use dashes rather than underscores too.

New Look (Take 2)

Once I’d worked out that a backup script had managed to overwrite some of the new MovableType files with copies of the previous version, things seemed to sort themselves out :–)

The new site design has some interesting features, in particular:

  • Live comment preview.
  • Live search.

Both of these are implemented using Bob Ippolito’s MochiKit, which I’m now a big fan of; it makes writing Javascript a lot less about tediously hitting the differences between the various implementations and a lot more about getting on with what it was you wanted to do.

New Look!

I’ve finally updated the site templates for my blog, and also updated to the latest MovableType at the same time.

There seems to be a character encoding issue somewhere or other, and I’m having trouble with comments too :–(

Named Groups and Conditionals in ICU Regexps

Since my original post on this topic, I’ve done a bit more work on the ICU regexp engine. It now supports named groups, and conditional expressions, including support for group numbers and names as well as lookahead and lookbehind expressions as the conditions on which to branch.

The syntax supported by this patch now includes:

(?P<name>)
Named capture group. Can be accessed via overloaded group(), start() and end() methods from C++, or using the uregex_groupIndexFromName() function from C.
(?P=name)
Named backreference.
(?(name-or-id)then-part|else-part)
Conditional expression; if the group identified by the numeric ID or name has been matched, then attempt a match against then-part, otherwise against else-part. The else-part is optional, in which case the ’|’ should also be omitted.
(?(?expr)then-part|else-part)
Conditional expression using a lookahead or lookbehind expression in place of ?expr. In this case, a match is attempted with the lookahead or lookbehind expression, and the result (true or false) used to choose whether to execute the then-part or the else-part as appropriate.

This version of the patch also fixes a couple of bugs in the previous version, simplifies the implementation of named capture groups (and at the same time disallows multiple capture groups with the same name), and adds quite a few new tests to the intltest and cintltest test suites to verify the operation of the new code.

It also adds the C functions uregex_namedGroupName() and uregex_namedGroupCount(), and the C++ method getGroupNames(), which provide a means for C and C++ code to obtain the names of the named capture groups for a particular RegexPattern object.

Mac Pro

Mmmmm… dual-core 3GHz Xeons, 1.6-2.1 times as fast as a PowerMac G5 Quad.

(MacNN are running a live feed from WWDC.)

Named Groups in ICU Regular Expressions

Python is a pretty neat language, and one of the best things about it is its runtime library. The only real downside is that Python itself isn’t really very fast, but of course you can work around this by making good use of the runtime library (much of which is implemented in C), or—if all else fails—by writing your own extension module. (This is hardly a new situation; in the 1980s many of us used to write high-performance code in assembly language so we could use it from our BASIC or even C programs; some versions of BASIC even had integrated support for assembly language… this was true, for instance of BBC BASIC on the Acorn machines, and GFA BASIC on the Atari platform.)

Anyway, one of the ways that Python programmers, like Perl and Ruby programmers, optimise their code is to make heavy use of regular expressions. The regular expression engines in these languages are fast, and it’s often much quicker to match a single pre-compiled regexp than it is to write Python code to scan a string. That in mind, Python has a feature called “named capture groups”; the Python library documentation says:

(?P<name>...)

Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named. So the group named ‘id’ in the example above can also be referenced as the numbered group 1.

For example, if the pattern is (?P<id>[a-zA-Z_]\w*), the group can be referenced by its name in arguments to methods of match objects, such as m.group('id') or m.end('id'), and also by name in pattern text (for example, (?P=id)) and replacement text (such as \g<id>).

(?P=name)

Matches whatever text was matched by the earlier group named name.

This is a great feature, and makes it much easier to write complicated regular expressions, since you can explicitly name the groups you want to extract and then you don’t have to worry about whether or not the indices are all going to change when you add that one extra capture group that you need for whatever-it-is that you’re doing.

However, and here’s the problem, it’s non-standard. Python supports it, but Ruby and Perl don’t support it, and, as the regular expression syntax for the ICU library was derived from Perl syntax, that doesn’t either.

Anyway, to cut a long story short, I want to use ICU with Python, and part of that means that I’d like regular expressions to work consistently (what I don’t want is to find that my regexp matches using one or other API, but then because e.g. the Unicode character databases differ, things then break). So I’ve written a patch for ICU 3.4 that adds support for Python-syntax named capture groups.

Interestingly, the regexp library used by Ruby, Oniguruma, does support named groups, though with a different syntax to Python. I could have implemented that also (indeed, I think, with my patch, it can be done by just changing regexcst.txt and regenerating the associated header file), however the Oniguruma syntax looks like the kind of thing that might clash with features in future versions of Perl (it uses (?<name>...) and \k<name>, whereas Python put all of its additions inside (?P...) to avoid clashes).

Update: a new version of this patch is available.

Microprocessor Myths

In the 4th August issue of MacUser (the U.K. magazine, not its U.S. cousin), I read yet again the assertion by a journalist—Kenny Hemphill—that “The truth is, it’s many years since the PowerPC has been able to compete with Intel or AMD processorsâ€?. It really annoys me to read this. Not only is it untrue, but it’s quite unfair to the very talented engineers who worked on (and continue to work on) the PowerPC processor ranges.

The press usually support this assertion with the argument that Apple changed microprocessor because the PowerPC couldn’t keep up. Sorry guys, but this one’s just plain wrong. Apple changed microprocessor because of the various architectures’ predictions of future performance per watt. It wasn’t that the PowerPC didn’t give the right bang for your buck, but that Intel showed them technology that ran cooler at the same performance level. And when they said that (at WWDC), Apple were talking about the Core family of processors, not any previous Intel designs.

What never ceases to amaze me is the poor standard of journalism that this constant assertion about PowerPC performance actually represents. Journalists are supposed to report facts, not total speculation, but whenever they come out with this one, they are ignoring the published performance figures for machines with the various processor architectures:

ProcessorClock Freq/GHzSPECint2000SPECfp2000
PowerPC G52.717062259
PowerPC G52.515872119
Pentium 43.818632091
Pentium 43.417051561
Athlon 64 FX-572.819702261
Athlon 64 FX-532.417001634
Core Duo T26002.1617961615

According to these figures (from the SPEC website), at 2.7GHz the G5 is equivalent to a 3.4GHz Pentium 4 or a 2.4GHz Athlon 64 FX-53 for integer performance, but can keep up with a 2.8GHz Athlon 64 FX-57 or significantly outperform a 3.8GHz Pentium 4 for floating point. As for the Core Duo chip used in the new Intel Macs, well has a slightly higher integer performance than the fastest G5 chip, but the G5 comfortably outperforms it at floating point.

The point is that the G5 is certainly comparable in performance terms with all but the very latest Intel and AMD chips, and in some areas, it will outperform them. The received wisdom from the popular computing press is completely wrong.

It is true that the Core 2 Duo is substantially faster. But it’s only just been released so it’s still a gross misrepresentation to point at the Core 2 Duo and claim that the PowerPC hasn’t been as fast as Intel or AMD offerings for ages.

URL Envy

For ages I’ve been envious of many other bloggers’ set-ups, but especially the nice friendly URL schemes that a lot of other people now use. So, starting from today, my website is at alastairs-place.net, rather than the previous www.alastairs-place.net, and you can navigate the articles using the form /yyyy/mm/article_name, or get an index of all of the articles for a given month with /yyyy/mm.

For instance, this article is http://alastairs-place.net/2006/07/url_envy.


Update 2011-10-14

Actually, since I updated to Octopress, the format is now /blog/yyyy/mm/dd/article-name. There are some redirections set up to go from the old format to the new one.

“Value Added Tax” (Aka Sales Tax)

HM Revenue & Customs statistics are apparently expected to show that VAT fraud in the U.K. has reached record levels according to the BBC today.

It’s hardly surprising that when you introduce any form of system that relies on the government paying back tax that has been claimed, you’re going to get a significant amount of fraud. We’ve seen it with Gordon Brown’s silly “tax credit” system, and we continue to see it with VAT.

What I don’t get is why people think we actually need VAT. Sure, it brings in a lot of tax revenue for the government, but:

  • It places a burden on business to account for and charge VAT correctly (this is very complicated; the rulebook is well over 300 pages long, and it isn’t even a complete list of all the rules that apply).
  • It artificially increases the prices of many goods, which can only be an inflationary pressure on the economy.
  • It is never adjusted. People sometimes claim that it gives the government an additional knob to twist, but it’s been 17.5% here in the U.K. for as long as I can remember.
  • It disproportionately taxes the poor. Everyone has to pay VAT, no matter how much they earn.
  • It distorts the E.U.’s single market, because the rules and rates differ from member state to member state (and there are 25 different member states, all of which have complicated VAT rules like ours; what’s more, they are all free to change them at various points during the year, and the penalties for non-compliance, as well as the time limits for compliance, vary from regime to regime… one or two countries even require you to employ some of their citizens in order to comply with their VAT regime!). The result is that the so-called “single market” is a joke; a hotchpotch of complicated rules administered on different terms by different states. You can even see that it isn’t really a single market, since you still have to fill out CN22 forms (under a different name, but it’s basically the same form) when sending goods between E.U. countries.
  • It creates difficulties for companies e.g. selling services or downloadable goods over the Internet. The E.U. insists that pretty much anything sold to its citizens should include VAT, but not everyone outside of the E.U., unsurprisingly, feels like paying tax to the European Union. Most of the large U.S. operations do, and a number of the smaller ones, but a lot simply don’t care, whereas those of us within the E.U. have to do it otherwise we’ll be prosecuted.
  • It creates additional opportunities for people to defraud companies (e.g. by quoting someone else’s VAT number—the E.U. doesn’t provide a sensible way for companies to verify that VAT numbers correspond to the identities of the people using them… you can do it, but you have to phone your local VAT administration, sit in a queue for hours on end and then read all the details out to an operator in order for them to tell you one way or another). In case you’re wondering, if you quote someone else’s VAT number, you may be breaking the law, but do HMRC care? No. They will pursue the company you defrauded for the VAT that you should have paid!
  • It creates additional opportunities for people to defraud the government.

There is no credible evidence that VAT is necessary or desirable. Many countries don’t have it, and their economies function perfectly well without it. We should forever curse the French, and in particular the late Maurice Lauré, for coming up with the idea.

What we should do is scrap VAT and raise the extra revenue through income taxes (in the U.K., that means Income Tax and Corporation Tax). They’re much fairer, since they’re calculated as a proportion of income, and (the daft tax-credit system aside) they offer fewer opportunities to defraud, as well as being simpler to account for.

OK, the U.K. government claims that VAT is good because it’s easy to collect (apparently), but what they really mean is that it’s easier for them.

(Note: I’m aware that VAT is collected differently from a traditional sales tax, before anyone points that out. Both types of sales tax, however, are a bad thing.)