Category Archives: tedir

Code Generation

I’m considering reimplementing part of my Java parser library, hadrian. At the lowest level, hadrian is essentially a virtual machine with an unusual instruction set and it parses input by interpreting parser bytecodes. One thing I’ve been thinking about is replacing this simple interpreter with something a bit more clever, for instance dynamically generated java bytecode. Today I decided to play around with this approach to see how complicated that would be and how much performance there is to gain by replacing a simple bytecode interpreter loop with a dynamically generated interpreter.

Instead of actually replacing the hadrian interpreter I thought I’d do something simpler. I made a class, Handler, with one method for each letter in the alphabet, each one doing something trivial:

class Handler {
private int value;
public void doA() { value += 1; }
public void doB() { value -= 1; }
public void doC() { value += 2; }
...
}

The exercise was: given a piece of “code” in the form of a text string containing capital letters, how do you most efficiently “interpret” this string and call the corresponding methods on a handler object. Or, to be more specific, how do you efficiently interpret this string not just once but many many times. If this can be done efficiently then it should be possible to apply the same technique to a real bytecode interpreter that does something useful, like the one in hadrian.

The straightforward solution is to use an interpreter loop that dispatches on each letter and calls the corresponding method on the handler:

void interpret(char[] code, Handler h) {
for (int i = 0; i < code.length; i++) {
switch (code[i]) {
case 'A': h.doA(); break;
case 'B': h.doB(); break;
case 'C': h.doC(); break;
...
}
}
}

That’s very similar to what hadrian currently has and it’s actually relatively efficient. It should be possible to do better though.

Java allows you to generate and load classes at runtime, but to be able to use this I had to write a small classfile assembler. There are frameworks out there for generating classfiles but I wanted something absolutely minimal, something that did just what I needed and nothing else. I found the classfile specification and wrote some code that generated a legal but empty class file that could be loaded dynamically. Surprisingly, it only took around 200 lines of code to get that working. Adding a simple bytecode assembler to that, so that I could generate methods, got it up to around 600 lines of code. That was still a lot less code than I had expected.

The simplest approach I could think of was to generate code that consisted of a sequence of calls to static methods, one static method for each “opcode” or in this case letter in the alphabet. Given a set of methods:

static void handleA(Handler h) { h.a(); }
static void handleB(Handler h) { h.b(); }
static void handleB(Handler h) { h.b(); }

I would, for a string such as "KBQ...", generate code corresponding to:

void run(Handler h) {
handleK(h);
handleB(h);
handleQ(h);
...
}

This means that instead of dispatching while running the code, you dispatch while generating code:

for (int i = 0; i < code.length; i++) {
gen.aload(0); // Load first argument
String name = "handle" + code[i];
Method method = getDeclaredMethod(name, Handler.class);
gen.invokestatic(method); // Call handler
}

I tried this and compared it with the simple interpreter it is, perhaps not surprisingly, quite a bit faster than the simple interpreter. On short strings, around 10-30 letters, it is around twice as fast as the interpreter. On longer strings, say 200-300 letters, it is 4-5 times faster. Nice.

I though it might be interesting to see where the theoretical limit was for this so the next thing I tried was to remove all overhead. Instead of generating calls, I generated code that performed the operations directly without going through any methods:

void run(Handler h) {
h.value += 6;
h.value -= 1;
h.value += 9;
...
}

This was a bit more complicated to implement but I got it up and running and surprisingly, it was exactly as fast as the previous experiment. Even though there were two calls for each opcode in the previous example, one to handleX and one to h.doX, the VM apparently inlines everything. This means that you can generate efficient straight-line code simply by having a static method for each bytecode, generating a sequence of calls to those methods and then leaving the rest to the VM. And generating classfiles containing straightforward code like that can be done efficiently in very few lines of java code. If I do end up reimplementing part of hadrian, I’m sure I’ll use this technique.

I also played around with various ways of adding arguments to the generated code, like:

void run(Handler h, int x, int y) {
handleK(h, x, y);
handleB(h, x, y);
handleQ(h, x, y);
...
}

It turns out that as soon as you add arguments, each opcode becomes twice as expensive to execute. I tried the same with the interpreter and there was no notable change in performance. But it should be possible to pack any state you need to carry around into the one argument that can be passed around without losing performance.

All in all I would say that this technique has a lot of potential, and I suspect it might have other applications than bytecode interpreters. I’m currently considering cleaning up the code and publishing it on google project hosting.

Update

This code now lives at googlecode, along with an example. The interface is pretty straightforward, especially for expressing bytecode. Here’s how this code:

public int call(int x) {
while (x < 15)
x = Main.twice(x);
return x;
}

can be expressed directly in bytecode:

Code call = classFile.addMethod("call", new Code(file) {{
// --- Labels ---
Label loopStart = new Label();
Label loopEnd = new Label();
// --- Code ---
bind(loopStart);
push(15);
iload(1);
isub();
iflt(loopEnd);
iload(1);
invokestatic(Main.class.getDeclaredMethod("twice", Integer.TYPE));
istore(1);
ghoto(loopStart);
bind(loopEnd);
iload(1);
ireturn();
}});
call.returnType(Integer.TYPE);
call.argumentTypes(Integer.TYPE);

The only annoyance is that some opcode names like return and goto are reserved so I have to use names like rethurn and ghoto, following the “silent h” renaming convention.

Combinators

Over on his blog, Gilad is having fun with parser combinators in smalltalk. Through the magic of smalltalk you can define a domain-specific language and then write parsers directly in the source code. For instance, this grammar


:= (|)*
| +
| "(" ")"

can be written in smalltalk as

expression
^ ((self letter), ([self letter] | [self digit]) star)
| (self digit) plus
| ((self delim: '('), [self expression], (self delim: ')'))

I won’t explain what it means, you should go read Gilad’s post for that.

Back before I wrote my master’s thesis (about dynamically extensible parsers) I actually wrote a simple prototype in squeak and, as you can see, smalltalk is fantastically well suited for this kind of stuff. So I thought, what the heck, maybe I’ll write a little parser combinator framework in squeak myself. So I did. Here’s a few thoughts about squeak and parser combinators.

Left recursion

Gilad mentions that you can’t express left recursive grammars directly using parser combinators. For instance, you can’t write

 :=  * 
| /
| +
| -
|

because must not occur on the far left side of a production in its own definition. But there are actually relatively few uses for left recursive productions and rather than restructuring the grammar to avoid left recursion you can add a few combinators to the framework that handle 95% of all uses for this. One place where you often see left recursive grammars is lists of elements separated by a token:

 ->  ","  | 

This pattern is used so often that it makes sense to introduce plus and star combinators that take an argument, a separator that must occur between terms:

exprList
^ (self expr) star: (self delim: ',')

The other and most complicated use of left recursion is operator definitions but that can also be expressed pretty directly in smalltalk with a dedicated combinator:

expr
^ (self ident) operators: [ :term |
(term * term).
(term / term).
(term + term).
(term - term).
]

This construct takes an “atomic” term, that’s the kind of term that can occur between the operators, and then defines the set of possible operators. The result is a parser that parses the left recursive expression grammar from before. You can also expand this pattern to allow precedence and associativity specifications:

expr
^ (self ident) operators: [ :term |
(term * term) precedence: 1.
(term / term) precedence: 1.
(term + term) precedence: 2.
(term - term) precedence: 2.
(term = term) precedence: 3; associate: #right.
]

It does take a bit of doesNotUnderstand magic to implement this but the result is a mechanism that kicks ass compared to having to restructure the grammar or define a nonterminal for each level in the precedence hierarchy. Also, it can be implemented very efficiently.

Squeak

It’s been a few months since I last used squeak and it seems like a lot has happened, especially on the UI side. I think squeak is an impressive system but in some areas it’s really an intensely poorly designed platform. That probably goes all the way back to the beginning of smalltalk.

The big overriding problem is, of course, that squeak insists on owning my source code. I’m used to keeping my source code in text files. I’m happy to let tools manage them for me like eclipse does but I want the raw text files somewhere so that I can create backups, put them under version control, etc. With squeak you don’t own your source code. When you enter a method it disappears into a giant soup of code. This is a problem for me in and of itself but it’s especially troubling if you’re unfortunate enough to crash the platform. That’s what happened to me: I happened to write a non-terminating method which froze the platform so that I had to put it down. That cost me all my code. No, sorry, that cost me its code. The worst part is that I know that smalltalkers think this ridiculous model is superior to allowing people to manage their own source code. Well, you’re wrong. It’s not that I like large source files, but I want to have some object somewhere outside the platform that contains my code and that I have control over. And yes I know about file out, that’s not what I’m talking about.

Another extremely frustrating issue is that squeak insists that you write correct code. In particular you’re not allowed to save a method that contains errors. I think it’s fine to notify me when I make an error. Sometimes during the process of writing code you may, say, refer to a class or variable before it has been defined or you may briefly have two variables with the same name. I don’t mind if the system tells me about that but squeak will insist that you change your code before you can save it. The code may contain errors not because it’s incorrect but because it’s incomplete. Squeak doesn’t care. I used another system like that once, the mjølner beta system, which was intensely disliked by many of us for that very reason.

This is just one instance of the platform treating you like you’re an idiot. Another instance is the messages. If you select the option to rename a method the option to cancel isn’t labelled cancel, no, it’s labeled Forget it — do nothing — sorry I asked. Give. Me. A. Break.

All in all, using squeak again was a mixed experience. As the parser combinator code shows smalltalk the language is immensely powerful and that part of it was really fun. But clearly the reason smalltalk hasn’t conquered the world is not just that back in the nineties, Sun convinced clueless pointy-haired bosses that they should use an inferior language like Java instead of smalltalk. It’s been a long time since I’ve used a programming language as frustrating as Squeak smalltalk. On the positive side, though, most of the problems are relatively superficial (except for the file thing) and if they ever decide to fix them I’ll be happy to return.

Hadrian 0.2

As recently mentioned, I’ve been working on improving the performance of my parser library. Now that I’m done picking all the low-hanging fruits, performance wise, I have decided to release the next version of hadrian, 0.2. This version contains a bunch of bugfixes and a few new features, but mainly it’s just a lot faster than 0.1.

Last time I wrote about it throughput was around 1 MB of input per second, discounting I/O, which was already a considerable improvement over 0.1. Now, it is around 1.5 MB per second. If you include I/O overhead that is around 1.2 MB per second on my machine. That’s actually faster than ANTLR, at least according to the benchmarks I used for my thesis.

Of course it doesn’t matter how fast it is. Even though it really is the world’s most ass-kickingest parser, people are unlikely to use it until I write some more documentation. Bugger.

Cheated!

Lately I’ve been working on optimizing the parser engine in tedir. It’s been going pretty well — over the last few days I’ve been able to get a 40% performance improvement (discounting I/O overhead) which brings the parse time down to around 1 second per megabyte of input for a pretty messy grammar. It’s still not as fast as the implementation I wrote for my thesis (I got it up to around 4 MB/s) but still pretty good considering that this is a generalized parser, not LL, LALR(1) or whatever.

Most of the work during parsing happens in just a handful of methods in a single class, Runtime. Optimizing tight loops like this is a lot of fun because you can learn a lot about what goes on in the VM by seeing which optimizations work as expected and which ones don’t. Sometimes I’ve made refactorings that I thought should obviously improve performance but didn’t. For instance, inlining methods by hand often doesn’t help because the VM will do that for you. On the other hand, sometimes optimizations do work as expected and that’s what makes it all worthwhile. You probably have to have tried it to know what I’m talking about.

Today, I came across an interesting example of an optimization that I was sure would give a great improvement in performance but which turned out to do exactly the opposite.

Tedir uses BitSets a lot. In particular, the methods that take most of the time during parsing do. If you’ve ever looked at the implementation of BitSet you will have noticed that it does not appear to have been, shall we say, particularly heavily optimized. At least, I got the distinct feeling that this annoying class was wasting my precious processor cycles big time. Naturally, I though I’ll write my own BitSet, FastBitSet, that’ll kick BitSet’s ass! So I did: I wrote a very simple bit set without all the bounds checks, basically BitSet with half the code or so discarded. I replaced all uses of BitSets with FastBitSets and ran the benchmarks. The result was shocking: not only was the implementation not faster, it was slower by a factor of two! I tried out various changes to the implementation but nothing helped. I got suspicious.

My colleague Kasper suggested that I should go back to java’s BitSet again and see how much time the profiler reported was spent executing methods in BitSet. Surprise: none! According to the profiler, no time at all was spent in methods in BitSet. I can only draw one conclusion from this: the VM is cheating — it must have special handling of the BitSet class. Incidentally, I work with a bunch of people that have previously worked on sun’s java VM but none of them knew anything about this optimization, at least they weren’t the ones who had implemented it. But I think the empirical evidence is pretty strong that the optimization takes place.

It’s hard to complain about an optimization that causes your code to become twice as fast without losing generality but still, I do feel kind of cheated.

Post-script: Until now my goal has been to post here roughly once a week. However, as long as the weather in Denmark is as good as it has been the last few weeks I won’t post as often. Sitting indoors in front of my computer just doesn’t have the same appeal in June as it does in January.

Hadrian Demo

My pet project over the last few years is a parser library for Java called hadrian. I was lucky enough to be able to write my master’s thesis about this project but that was neither the beginning nor the end of it. Since my thesis I’ve rewritten it completely and now we’re using the new library in the neptune compiler.

One thing I’ve neglected, though, is documentation. I’ll try to write a few posts about it here at some point, to get started, but until then I’ve thrown together a demo applet which demonstrates some aspects of the library. The demo allows you to write a grammar and some input and then parses the input according to the grammar and shows the result as a syntax tree and XML. It also contains a few examples to play around with.

In itself, this demonstrates two things. Firstly, that hadrian allows you to parse input according to a grammar that wasn’t generated a compile-time but constructed by the running program. Secondly, that the result of parsing the input is constructed by the framework — you don’t have to specify your own ASTs. That’s only a small pat of what you can do but it’s pretty cool in itself. If I do say so myself. Which I do.

The code that does the actual work of the applet is pretty simple. Here it is (I’ll explain what it does below)

  private void processInput(final String grammarString,
final String inputString) {
1 final ITextDocument grammarDoc = new StringDocument(grammarString);
2 final MessageMonitor monitor = new MessageMonitor();
3 final HadrianParser parser =
HadrianReader.getDefault().readParser(grammarDoc, monitor);
4 if (processMessages(monitor)) return;
final ITextDocument inputDoc =
new StringDocument(inputString);
5 final INode node = parser.parse(inputDoc, monitor);
if (processMessages(monitor)) return;
final DefaultTreeModel treeModel =
new DefaultTreeModel(SyntaxTreeNode.create(node, null));
6 syntaxTree.setModel(treeModel);
7 syntaxXML.setText(TextUtils.toXML(node));
}

Here’s what’s going on in the code.

  1. The string containing the grammar is wrapped in a text document. A text document is pretty similar to a string except that it keeps track of where line breaks are and stuff like that. That can be very handy for instance when reporting errors in a document.
  2. We create a message monitor. A message monitor collects any messages that the system might generate while processing the grammar.
  3. We read the grammar and construct a parser. If this goes well, readParser returns a parser we can use to parse the input. If something goes wrong it reports an error to the message monitor which we…
  4. …check for and bail out if necessary.
  5. We wrap the string containing the input and parse it using the parser we just successfully constructed from the grammar.
  6. After checking that parsing went well we take the result, a syntax tree in the form of an INode, and wrap it in something that can be displayed in the UI.
  7. Finally, we also convert the syntax tree to XML and plant the result in the appropriate text box

Grammars don’t have to be read from source files or strings, there’s also an API that allows programs to construct them directly. But it turns out to be pretty handy to be able to read them from source files.

If you want to take a look at how the applet is implemented, the source is available with the code in tedir-applet.jar. Hadrian itself lives at sourceforge, as a part of the tedir project.

Tedir semi-release

The first official tedir-related release is a reality: the current version of hadrian can now be downloaded from sourceforge. Also, javadoc has been released for all sub-projects of tedir:

I am currently working on a demo applet that lets you try hadrian without installing anything, and possibly a java web start application if I can finish it before I lose interest. Also, a tutorial is under ways. And by “under ways” I mean that all I have done so far is spend 5 minutes on jotting down a tentative outline.