Lately I’ve been working on optimizing the parser engine in tedir. It’s been going pretty well — over the last few days I’ve been able to get a 40% performance improvement (discounting I/O overhead) which brings the parse time down to around 1 second per megabyte of input for a pretty messy grammar. It’s still not as fast as the implementation I wrote for my thesis (I got it up to around 4 MB/s) but still pretty good considering that this is a generalized parser, not LL, LALR(1) or whatever.
Most of the work during parsing happens in just a handful of methods in a single class, Runtime. Optimizing tight loops like this is a lot of fun because you can learn a lot about what goes on in the VM by seeing which optimizations work as expected and which ones don’t. Sometimes I’ve made refactorings that I thought should obviously improve performance but didn’t. For instance, inlining methods by hand often doesn’t help because the VM will do that for you. On the other hand, sometimes optimizations do work as expected and that’s what makes it all worthwhile. You probably have to have tried it to know what I’m talking about.
Today, I came across an interesting example of an optimization that I was sure would give a great improvement in performance but which turned out to do exactly the opposite.
Tedir uses BitSets a lot. In particular, the methods that take most of the time during parsing do. If you’ve ever looked at the implementation of BitSet you will have noticed that it does not appear to have been, shall we say, particularly heavily optimized. At least, I got the distinct feeling that this annoying class was wasting my precious processor cycles big time. Naturally, I though I’ll write my own BitSet, FastBitSet, that’ll kick BitSet’s ass! So I did: I wrote a very simple bit set without all the bounds checks, basically BitSet with half the code or so discarded. I replaced all uses of BitSets with FastBitSets and ran the benchmarks. The result was shocking: not only was the implementation not faster, it was slower by a factor of two! I tried out various changes to the implementation but nothing helped. I got suspicious.
My colleague Kasper suggested that I should go back to java’s BitSet again and see how much time the profiler reported was spent executing methods in BitSet. Surprise: none! According to the profiler, no time at all was spent in methods in BitSet. I can only draw one conclusion from this: the VM is cheating — it must have special handling of the BitSet class. Incidentally, I work with a bunch of people that have previously worked on sun’s java VM but none of them knew anything about this optimization, at least they weren’t the ones who had implemented it. But I think the empirical evidence is pretty strong that the optimization takes place.
It’s hard to complain about an optimization that causes your code to become twice as fast without losing generality but still, I do feel kind of cheated.
Post-script: Until now my goal has been to post here roughly once a week. However, as long as the weather in Denmark is as good as it has been the last few weeks I won’t post as often. Sitting indoors in front of my computer just doesn’t have the same appeal in June as it does in January.