Monthly Archives: January 2007

In Brief

A few short messages.

Completion: I’m beginning to lose interest in the API gadget so I’m thinking of removing it. If you’re using it and don’t want me to remove it feel free to let me know.

Pattern Matching: A while back I wrote a post about pattern matching in the Scala language. I also had a discussion with Burak Emir on the Scala mailing list. The post was not too well written and I was my usual annoying self on the mailing list so I didn’t actually manage to change anyone’s opinion on the subject. Since, however, the Scala folks have whipped up a new paper on the subject, Matching Objects With Patterns which should be of interest to anyone who finds that sort of thing interesting.

Monkey’s @: Back in the day when I was working on wildcards in Java, before javac was open sourced, you could introduce all the silly variable names you wanted without anyone noticing. Now anyone can rummage around in the javac source and discover stuff like the MONKEYS_AT value. Neal breaks down and explains where that came from.

That’s it. Carry on.

Code Generation

I’m considering reimplementing part of my Java parser library, hadrian. At the lowest level, hadrian is essentially a virtual machine with an unusual instruction set and it parses input by interpreting parser bytecodes. One thing I’ve been thinking about is replacing this simple interpreter with something a bit more clever, for instance dynamically generated java bytecode. Today I decided to play around with this approach to see how complicated that would be and how much performance there is to gain by replacing a simple bytecode interpreter loop with a dynamically generated interpreter.

Instead of actually replacing the hadrian interpreter I thought I’d do something simpler. I made a class, Handler, with one method for each letter in the alphabet, each one doing something trivial:

class Handler {
private int value;
public void doA() { value += 1; }
public void doB() { value -= 1; }
public void doC() { value += 2; }
...
}

The exercise was: given a piece of “code” in the form of a text string containing capital letters, how do you most efficiently “interpret” this string and call the corresponding methods on a handler object. Or, to be more specific, how do you efficiently interpret this string not just once but many many times. If this can be done efficiently then it should be possible to apply the same technique to a real bytecode interpreter that does something useful, like the one in hadrian.

The straightforward solution is to use an interpreter loop that dispatches on each letter and calls the corresponding method on the handler:

void interpret(char[] code, Handler h) {
for (int i = 0; i < code.length; i++) {
switch (code[i]) {
case 'A': h.doA(); break;
case 'B': h.doB(); break;
case 'C': h.doC(); break;
...
}
}
}

That’s very similar to what hadrian currently has and it’s actually relatively efficient. It should be possible to do better though.

Java allows you to generate and load classes at runtime, but to be able to use this I had to write a small classfile assembler. There are frameworks out there for generating classfiles but I wanted something absolutely minimal, something that did just what I needed and nothing else. I found the classfile specification and wrote some code that generated a legal but empty class file that could be loaded dynamically. Surprisingly, it only took around 200 lines of code to get that working. Adding a simple bytecode assembler to that, so that I could generate methods, got it up to around 600 lines of code. That was still a lot less code than I had expected.

The simplest approach I could think of was to generate code that consisted of a sequence of calls to static methods, one static method for each “opcode” or in this case letter in the alphabet. Given a set of methods:

static void handleA(Handler h) { h.a(); }
static void handleB(Handler h) { h.b(); }
static void handleB(Handler h) { h.b(); }

I would, for a string such as "KBQ...", generate code corresponding to:

void run(Handler h) {
handleK(h);
handleB(h);
handleQ(h);
...
}

This means that instead of dispatching while running the code, you dispatch while generating code:

for (int i = 0; i < code.length; i++) {
gen.aload(0); // Load first argument
String name = "handle" + code[i];
Method method = getDeclaredMethod(name, Handler.class);
gen.invokestatic(method); // Call handler
}

I tried this and compared it with the simple interpreter it is, perhaps not surprisingly, quite a bit faster than the simple interpreter. On short strings, around 10-30 letters, it is around twice as fast as the interpreter. On longer strings, say 200-300 letters, it is 4-5 times faster. Nice.

I though it might be interesting to see where the theoretical limit was for this so the next thing I tried was to remove all overhead. Instead of generating calls, I generated code that performed the operations directly without going through any methods:

void run(Handler h) {
h.value += 6;
h.value -= 1;
h.value += 9;
...
}

This was a bit more complicated to implement but I got it up and running and surprisingly, it was exactly as fast as the previous experiment. Even though there were two calls for each opcode in the previous example, one to handleX and one to h.doX, the VM apparently inlines everything. This means that you can generate efficient straight-line code simply by having a static method for each bytecode, generating a sequence of calls to those methods and then leaving the rest to the VM. And generating classfiles containing straightforward code like that can be done efficiently in very few lines of java code. If I do end up reimplementing part of hadrian, I’m sure I’ll use this technique.

I also played around with various ways of adding arguments to the generated code, like:

void run(Handler h, int x, int y) {
handleK(h, x, y);
handleB(h, x, y);
handleQ(h, x, y);
...
}

It turns out that as soon as you add arguments, each opcode becomes twice as expensive to execute. I tried the same with the interpreter and there was no notable change in performance. But it should be possible to pack any state you need to carry around into the one argument that can be passed around without losing performance.

All in all I would say that this technique has a lot of potential, and I suspect it might have other applications than bytecode interpreters. I’m currently considering cleaning up the code and publishing it on google project hosting.

Update

This code now lives at googlecode, along with an example. The interface is pretty straightforward, especially for expressing bytecode. Here’s how this code:

public int call(int x) {
while (x < 15)
x = Main.twice(x);
return x;
}

can be expressed directly in bytecode:

Code call = classFile.addMethod("call", new Code(file) {{
// --- Labels ---
Label loopStart = new Label();
Label loopEnd = new Label();
// --- Code ---
bind(loopStart);
push(15);
iload(1);
isub();
iflt(loopEnd);
iload(1);
invokestatic(Main.class.getDeclaredMethod("twice", Integer.TYPE));
istore(1);
ghoto(loopStart);
bind(loopEnd);
iload(1);
ireturn();
}});
call.returnType(Integer.TYPE);
call.argumentTypes(Integer.TYPE);

The only annoyance is that some opcode names like return and goto are reserved so I have to use names like rethurn and ghoto, following the “silent h” renaming convention.

Combinators

Over on his blog, Gilad is having fun with parser combinators in smalltalk. Through the magic of smalltalk you can define a domain-specific language and then write parsers directly in the source code. For instance, this grammar


:= (|)*
| +
| "(" ")"

can be written in smalltalk as

expression
^ ((self letter), ([self letter] | [self digit]) star)
| (self digit) plus
| ((self delim: '('), [self expression], (self delim: ')'))

I won’t explain what it means, you should go read Gilad’s post for that.

Back before I wrote my master’s thesis (about dynamically extensible parsers) I actually wrote a simple prototype in squeak and, as you can see, smalltalk is fantastically well suited for this kind of stuff. So I thought, what the heck, maybe I’ll write a little parser combinator framework in squeak myself. So I did. Here’s a few thoughts about squeak and parser combinators.

Left recursion

Gilad mentions that you can’t express left recursive grammars directly using parser combinators. For instance, you can’t write

 :=  * 
| /
| +
| -
|

because must not occur on the far left side of a production in its own definition. But there are actually relatively few uses for left recursive productions and rather than restructuring the grammar to avoid left recursion you can add a few combinators to the framework that handle 95% of all uses for this. One place where you often see left recursive grammars is lists of elements separated by a token:

 ->  ","  | 

This pattern is used so often that it makes sense to introduce plus and star combinators that take an argument, a separator that must occur between terms:

exprList
^ (self expr) star: (self delim: ',')

The other and most complicated use of left recursion is operator definitions but that can also be expressed pretty directly in smalltalk with a dedicated combinator:

expr
^ (self ident) operators: [ :term |
(term * term).
(term / term).
(term + term).
(term - term).
]

This construct takes an “atomic” term, that’s the kind of term that can occur between the operators, and then defines the set of possible operators. The result is a parser that parses the left recursive expression grammar from before. You can also expand this pattern to allow precedence and associativity specifications:

expr
^ (self ident) operators: [ :term |
(term * term) precedence: 1.
(term / term) precedence: 1.
(term + term) precedence: 2.
(term - term) precedence: 2.
(term = term) precedence: 3; associate: #right.
]

It does take a bit of doesNotUnderstand magic to implement this but the result is a mechanism that kicks ass compared to having to restructure the grammar or define a nonterminal for each level in the precedence hierarchy. Also, it can be implemented very efficiently.

Squeak

It’s been a few months since I last used squeak and it seems like a lot has happened, especially on the UI side. I think squeak is an impressive system but in some areas it’s really an intensely poorly designed platform. That probably goes all the way back to the beginning of smalltalk.

The big overriding problem is, of course, that squeak insists on owning my source code. I’m used to keeping my source code in text files. I’m happy to let tools manage them for me like eclipse does but I want the raw text files somewhere so that I can create backups, put them under version control, etc. With squeak you don’t own your source code. When you enter a method it disappears into a giant soup of code. This is a problem for me in and of itself but it’s especially troubling if you’re unfortunate enough to crash the platform. That’s what happened to me: I happened to write a non-terminating method which froze the platform so that I had to put it down. That cost me all my code. No, sorry, that cost me its code. The worst part is that I know that smalltalkers think this ridiculous model is superior to allowing people to manage their own source code. Well, you’re wrong. It’s not that I like large source files, but I want to have some object somewhere outside the platform that contains my code and that I have control over. And yes I know about file out, that’s not what I’m talking about.

Another extremely frustrating issue is that squeak insists that you write correct code. In particular you’re not allowed to save a method that contains errors. I think it’s fine to notify me when I make an error. Sometimes during the process of writing code you may, say, refer to a class or variable before it has been defined or you may briefly have two variables with the same name. I don’t mind if the system tells me about that but squeak will insist that you change your code before you can save it. The code may contain errors not because it’s incorrect but because it’s incomplete. Squeak doesn’t care. I used another system like that once, the mjølner beta system, which was intensely disliked by many of us for that very reason.

This is just one instance of the platform treating you like you’re an idiot. Another instance is the messages. If you select the option to rename a method the option to cancel isn’t labelled cancel, no, it’s labeled Forget it — do nothing — sorry I asked. Give. Me. A. Break.

All in all, using squeak again was a mixed experience. As the parser combinator code shows smalltalk the language is immensely powerful and that part of it was really fun. But clearly the reason smalltalk hasn’t conquered the world is not just that back in the nineties, Sun convinced clueless pointy-haired bosses that they should use an inferior language like Java instead of smalltalk. It’s been a long time since I’ve used a programming language as frustrating as Squeak smalltalk. On the positive side, though, most of the problems are relatively superficial (except for the file thing) and if they ever decide to fix them I’ll be happy to return.

Final

In his recent java 7 wish list Peter Ahé attributes a proposal to me, a shorthand syntax for variable declaration. In this post I’ll give a bit of background on where that idea came from and why I think it deserves to make it into java 7.

The idea is really very simple: when you declare a final variable with an initializer you are allowed to leave out the type of the variable; if you do, the type will be inferred from the initializer expression. Peter’s example is this:

final strings = new ArrayList();

The variable strings will get the type ArrayList because that’s the type of the expression used to initialize it. That’s all there is to it. The reason this idea is even worth writing about is not the mechanism in itself but the fact that a number of really nice properties fall out if you link type inference and final together.

But first I have to confess that this may not have been my idea at all. All I know is that it came out of a discussion I had with Kasper Lund last winter while we worked together in Esmertec. I’ve since talked with Kasper about it and none of us can remember who actually suggested it.

The idea came out of my code style at the time. In his (semi-)recent follow-up rant on agile programming, Steve Yegge talks about slob type systems and neat-freak type systems and the slobs and neat-freaks among programmers who prefer either one or the other. Well, I am — emphatically — a neat freak (as Erik saw fit to point out recently). If you take a look a some of my source code I’m sure you’ll agree. I need my type declarations, my @Override annotations and, more than anything, I need my final variables. I can say without blushing that I was probably the biggest fan of final variables on the planet. Unfortunately, writing final on all variables whose value doesn’t change is really verbose. Hence the idea to leave out the part of a final variable declaration that is actually redundant: the type. That is, in a nutshell, where the idea came from. But there’s more to it.

Final

Historically, the concept of local variables has been tied to mutability. It’s even part of the name: it’s called a local variable, not a local constant. It’s only natural, really: the way local definitions (to use a mutable/immutable neutral word) have historically been implemented as locations on the call stack which means that it is trivial to make them mutable. Why not by default give programmers added flexibility and allow them to change them? Well, there is actually good reason not to.

What is a local “variable”? It’s a location or slot where you can store different values over time. Consider this piece of code:

String myMethod() {
String str = "bleep";
...
... // many lines of code
...
return str;
}

Say you’ve run into a NullPointerException somewhere and suspect that this method might return null, that is, that str is null when the method returns. It starts out non-null but since it’s a variable you have to read and understand, at least to some degree, all the code between the declaration and the return. Maybe there are no assignments, in which case you know the value is still "bleep" when the method returns. I would venture a guess that across all the world’s java code that is the common case: variables are usually only written once. But you don’t know until you’ve read the whole method.

Variables that are initialized and then never changed are a different beast than variables that may be written multiple times, and they are much simpler. When you look at the declaration of such a variable you know not only the type of the variable but also the value. If you knew in advance in the code above that str doesn’t change, you can safely ignore all other code in the method and focus on the value used to initialize str since you know that that’s the value that will eventually be returned. The initializer may be a complex expression, sure, but it is still a simpler problem to figure out the value of an expression than to figure out the value of an expression and figure out how and when the variable might change its value. You’re not creating a slot where you can store different values but just computing a value and then giving it name.

That’s why I started writing final on all variables whose values never changed. It makes it a lot easier to skim through code and understand what is going on. In particular, since most variables will be final, it actually makes non-final variables stand out. After I’d used this for a while I stopped noticing final, since that’s the common case, and instead started to notice non-final variables. Maybe it has never been a problem for you to distinguish between variables that change and variables that don’t but there’s no question about the fact that when you read a piece of code you have to dedicate a few mental cycles to sort this out whenever the code refers to a variable. And in my experience mental cycles is an extremely precious resource when reading code, which is why I’m such a neat-freak (well, that and plain OCD). If writing final or @Override saves cycles for future readers of my code I’ll be inclined to do it; that future reader is likely to be myself.

Writing final everywhere is verbose, though. My god is it verbose. I have always have little patience for people who think that the fewer keystrokes it takes to write a program the better. Needless to say, perl is not my favorite language. In the mythical man month, Fred Brooks says that programmers seemed to generate about the same amount of code per day regardless of the language they use (which is bad news if 25% of the code you write is the word final). Had it been someone other than Fred Brooks I’d have called it rubbish. Paul Graham says that he had seen little to contradict this hypothesis. I have no problem calling that rubbish. The logic and abstractions used in a program, those take time to design and understand. Sitting down in front of a keyboard and pushing buttons isn’t what takes the time.

Having said that, though, there’s such a thing as ad absurdum, and so a few months ago I decided that it was enough with the final. It makes the code easier to read, sure, but I’m afraid if I kept it up I’d have arthritis by the time I was 30. It was quite a relief and I doubt I’ll ever revert to my old coding style. But it is frustrating: I felt I was forced to do the Wrong Thing just because the Right Thing was too verbose even for me. Bugger.

Inference

But let’s turn to another source of sore fingers: generics. A declaration like this

HashMapnew HashMap

is around 50% redundant. And it gets worse, much worse. So obviously people have been looking for a way to remove part of this redundancy. I’ve heard several different suggestions. You can use inference to guess the type arguments of the new expression:

HashMap map = new HashMap();

I think this proposal is a bit confusing and it has the limitation that it only really works with new expression. On the plus side it allows you to use different types for the variable and the constructor:

Map map = new HashMap();

Another option is to use a shorthand syntax that resembles C++’s stack allocated object syntax:

HashMap map();

This syntax is also limited to object creation, it (sort of) collides with the method declaration syntax, and had the added limitation that the type of the variable must be a class type which is unfortunate if you want to assign other values to the variable. On the other hand if the variable is immutable I actually think it’s a pretty reasonable syntax.

Finally, you can infer the variable type from the initializer:

final map = new HashMap();

or, with the syntax proposed by James Gosling,

map := new HashMap

The advantage of this is that the syntax is not limited to variables initialized with a new expression, and the type of the variable can be an interface type. On the minus side, it can become difficult to figure out what the type of a variable is:

final obj = myMethod();

The most important thing about unifying type inference and final, though, is that is avoids a number of problems you otherwise have when mixing type inference and mutability (which is a well-known source of trouble).

A mutable variable can be assigned any number of times during its lifetime and it’s not a given that it makes sense to just use the type of the initial assignment as the type of the variable. Imagine you’re doing something like:

static Properties getProperties(String name) { ... }

props := getProperties("qbert");
if (props == null)
props = Collections.emptyMap(); // error

This code looks pretty reasonable but it doesn’t type check because the initial assignment gives props the type Properties when the intended type was actually the less specific Map (which Properties is a subtype of). It is also fragile. You could imagine that getProperties had originally been written with the return type Map, in which case the code above would work. If you then later decided that the type could be narrowed to Properties, which seems pretty innocent, you would unexpectedly break the code.

On the other hand, if you only assign to a variable once you can never go wrong in using the type of that value as the type of the variable. If you were later to narrow the type of a variable or method used to initialize such a variable you wouldn’t risk breaking code that depends on the type not being overly specific. Also, I would say that adding another level of complexity to mutable variables is not doing anyone a favor. At least, if you were to do that it should be through a syntax that stood out more than just adding a single colon (:=) which is easily overlooked. But that’s a matter of taste more than anything.

Conclusion

I think there’s a very good chance that a variable type inference mechanism will be added in Java 7. Making it easier to declare final variables would be a good thing and linking type inference with final solves a number of problems. There’s a risk that code may become harder to read because this proposal not only allows you to leave out type declarations when they’re redundant but also when they’re not. In most languages you could just trust programmers to exercise judgment when deciding whether or not to leave out variable types. Java’s design philosophy is based on not trusting people to make that kind of decision so introducing a mechanism like this would be a bit of a departure for this language. I hope that won’t end up standing in the way of introducing this or a similar mechanism.

Completion #3

I’ve been working on the JavaDoc API gadget over Christmas. I know, it’s as if I just don’t understand the concept of “holidays”. Anyway.

The UI now works perfectly in firefox and safari (at least as far as I can tell), in opera with just a single glitch and reasonably well in IE as long as you don’t resize the browser window. I have also fixed a few bugs, including some issues with inner classes, and implemented a feature several people have been asking for: pressing in the text field now takes you to the top item on the list.

A less visible improvement is that a lot less data now has to be transferred to the gadget when you start to write in the text field — the class index is down from 60K to 20K in browsers that support gzipped content (which these days turns out to be pretty much any of them).

The downside is that the code is now pretty atrocious. I quickly realized that having just one HTML document and one javascript file for all the different browsers was no good. Instead, I now generate different files for each of firefox, safari, opera and explorer. In fact the files are being generated using the most stone-age technique imaginable: cpp. Here’s an actual piece of code from the implementation:

function KeyEventMatches(e, r) {
#ifdef EXPLORER
var c = window.event.keyCode;
#else
var c = e.which;
#endif
return String.fromCharCode(c).match(r);
}

I know, absolutely terrible but it actually works pretty well and, most importantly, requires no resources at runtime.

Oh, and I’ve moved the gadget here:


Add to Google

The old location will disappear shortly.

I hope to clear up the last outstanding bugs, especially the thing with resizing in explorer, this week.