Category Archives: java

Finally

Before I had used C++ I had quite a strong dislike for it. I remember wondering why anyone would voluntarily use C++? and being frustrated when reading interviews with Bjarne Stroustrup because he wouldn’t admit that C++, or at least parts of it, sucks.

Well, for the last few years I’ve actually been using C++ and have ended up changing my mind completely. As I expect most C++ programmers do I stay within a subset of the language, but that subset is now one of my favorite languages: not too complicated, powerful and concise. With the other languages I use regularly, Java, Python and JavaScript, I often find myself thinking “man, what were they thinking when they designed this feature?”. That practically never happens with C++. It can be complex and confusing but when taking the constraints and interactions with other language features into account I can’t think of a single instance where I’ve been able to come up with a better solution. I realize that is quite an arrogant way to put it but look at it this way: coming from an arrogant and opinionated person it is a huge compliment.

One of my favorite features is stack-allocated objects because they enable the resource acquisition is initialization pattern. Given a choice between doing resource handling in Java, using finalizers for objects and finally clauses for scopes, and C++ using RAII, I would choose C++ any day.

The weaknesses of finalizers are well-known so I won’t say more about them. The problem with finally clauses is closely related to the subject of making wrong code look wrong. Finally clauses divorce the code that releases a resource from the code that allocates it, which makes it much harder to see when you forget to clean up.

FileReader in = new FileReader(fileName);
try {
/* process file */
} finally {
in.close();
}

Using a finally clause means having the code for cleaning up in on the other side of the code that uses it. You have to scan the whole method to convince yourself that in will indeed be closed. An even worse issue is the fact that the try part has a scope of its own that the finally clause can’t see. This means that anything you do after having allocated a resource will, by default, be hidden from the finally clause. Because of this the pattern above doesn’t scale if you want to open two files:

FileReader in = new FileReader(inFileName);
FileWriter out = new FileWriter(outFileName);
try {
/* process files */
} finally {
in.close();
out.close();
}

If the second constructor throws an exception, and you can’t be sure it doesn’t, then the finally clause is not yet in effect and so in won’t be cleaned up. You can’t declare the out variable in the try clause because then the finally clause can’t see it so what people often end up doing is:

FileReader in = null;
FileWriter out = null;
try {
in = new FileReader(inFileName);
out = new FileWriter(outFileName);
/* process files */
} finally {
if (in != null) in.close();
if (out != null) out.close();[1]
}

What we’re trying to do here is really very simple: we want to be sure that these files are closed before leaving the scope. With try/finally the solution is more complicated than the problem.

In C++ the destructor of stack-allocated objects give you a way to execute code exactly when you leave a scope. Here’s how you could express the same thing in C++:

own_resource in_resource = fopen(in_file_name, "r");
FILE* in = *in_resource;
if (in == NULL) return;
own_resource out_resource = fopen(out_file_name, "w");
FILE* out = *out_resource;
if (out == NULL) return;
/* process files */

The resources we acquire here are stored in a stack-allocated own_resource; the destructor of the own_resource class takes care of releasing the value stored in it. The language ensures that no matter how control leaves the current scope the own_resource destructors will be called first. Furthermore I can see right at the call to fopen that the result will be cleaned up. You don’t have to use this pattern long before any call that allocates a resource and does not store it in an own_resource (or an own_ptr or some other own_ object) looks very wrong.

How to manage resources is closely related to how to signal errors. Most new languages use some form of exceptions, even though they have acquired something of a bad name. There are many problems with exceptions (none of which can be solved with error codes by the way) but being able to specify how a resource should be cleaned up at the same time as you allocate it does solve many of those problems. Getting this right requires you to assume that any part of your code could throw an exception. It’s a slightly nicer version of preemption really: someone might interrupt you at any nontrivial operation but unlike preemption all you must be able to do is bail out, you don’t have to be able to continue running. That is doable. Furthermore it’s just a fact of life: whether you use exceptions or not you may have to bail out at any nontrivial operation, because of stack overflow, memory exhaustion, SIGINT, or whatever. The best way to defend yourself against that is to program defensively, and to write your code in a style where resource allocation operations stand out if they don’t immediately make provisions for deallocation. I don’t know a language that does that better than C++.

However, as much as I’ve been praising C++, their solution relies much too heavily on static typing for my taste. An interesting challenge is to design a way for flexible object-oriented RAII to work in dynamically typed languages. That is something I will be working on and I hope I’ll eventually be back here with a good solution.


1: As Bob points out in the comments there’s a bug here: if in.close() throws an exception then out will not be closed. It’s another illustration of how hard this stuff is to get right. Note that the C++ example is not affected by this issue: if the destructor for out (which is the first to be called) throws an exception the same rules apply with regards to in: the exception causes us to leave its scope so its destructor will be called.

JavaFX

Whaddayaknow, Sun has created a new “family of products”, JavaFX. This thing promises to “simplify and speed the creation and deployment of high-impact content for a wide range of devices”. The benefits are that it “reduces integration costs, improves device software consistency, and enables handset manufactures to provide new offerings with substantially faster time-to-market”. You have to hand it to Sun’s marketing people that’s top grade bullshit.

But if you look beyond the marketing crap one member of this “family of products” is a brand new programming language, JavaFX Script. There hasn’t been written a lot about it yet but I did manage to find this article that gives a quick introduction.

It is not a general-purpose language but:

[…] designed to optimize the creative process of building rich and compelling UIs leveraging Java Swing, Java 2D and Java 3D for developers and content authors.

Behind the superficial similarities with Java and JavaScript this new language has some very interesting features of its own. Here are a few of the features that caught my interest.

It is an object oriented language (runs on the JVM) but not everything is an object.

Arrays represent sequences of objects. In JavaFX arrays are not themselves objects, however, and do not nest. Expressions that produce nested arrays […] are automaticallly flattened […]

Hmm…

Arrays have a special status in the JavaFX language and are supported by some special syntactic constructs, for instance

var ints = [1, 2, 3, 4, 5];
insert 10 after ints[. == 3]
// ints is now [1, 2, 3, 10, 4, 5]

Note that the language not dynamically typed as the var declaration might seem to suggests; the ints variable is statically inferred to be an array of integers. Also note the use of a predicate to identify an element of an array. This is pretty neat:

var smallInts = ints[. < 4];
// smallInts is now [1, 2, 3]

JavaFX distinguishes between procedures and pure functions. There is a special syntax for declaring a pure function:

function dist(x, y) {
var xSqr = x * x;
var ySqr = y * y;
return sqrt(xSqr + ySqr);
}

A pure function declaration must be a sequence of variable declarations and a single final return statement. I guess they check statically that all functions called from within a pure function are also pure. If a procedure has side effects it must be declared with an operation declaration.

There are list comprehensions (array comprehensions?) with a relatively standard syntax:

select n * n from n in [1 .. 100]

Is is really so hard to find a list comprehension syntax where the declaration of the variable doesn’t come after its use? Apparently…

While the syntax is Java-like there are some syntactic differences from other languages in the C family. For logical operators they use the keywords and, or and not rather than &&, || and !. They also don’t mandate parentheses around conditions in if and while statements. Those are great changes, especially the logical operator keywords; this,

not (a == null or a.isEmpty())

reads much better than

!(a == null || a.isEmpty())

Also, if the precedence is right, it means that you can write

if (not x instanceof Foo) ...

rather than

if (!(x instanceof Foo)) ...

Another great feature is string interpolation; you can write stuff like

var answer = true;
var str = "The answer is {if answer then "Yes" else "No"}.";
// str is now "The answer is Yes."

I’ve used string interpolation in various languages and it’s great to have.

There is a reflective interface that looks a lot like Java’s except that it looks more straightforward and easier to use. There are a few extensions; a particularly spicy one is an operation that yields the extent of a class. That’s right, the syntax *:Foo yields an array of all instances of class Foo:

// Print all strings in this program
for (str in *:String)
System.out.println(str);

The clever reader might wonder if the result of *:Array contains itself since the result is an array? The really clever reader remembers that arrays aren’t objects and so can’t be enumerated by *:.

Why any language should have such a feature is a mystery to me and, it seems, also to them:

Note: this is an optional feature and is disabled by default.

Hopefully they will come to their senses and remove it completely before it’s too late and they have to maintain an implementation.

There are many more interesting features in this language which you can read about in the language reference. An implementation can be downloaded from their website; it runs on the JVM.

Overall it looks like a language with its very own peculiar flavor not quite like any languages I know. It’ll be interesting to see how well it succeeds in attracting programmers.

Final

In his recent java 7 wish list Peter Ahé attributes a proposal to me, a shorthand syntax for variable declaration. In this post I’ll give a bit of background on where that idea came from and why I think it deserves to make it into java 7.

The idea is really very simple: when you declare a final variable with an initializer you are allowed to leave out the type of the variable; if you do, the type will be inferred from the initializer expression. Peter’s example is this:

final strings = new ArrayList();

The variable strings will get the type ArrayList because that’s the type of the expression used to initialize it. That’s all there is to it. The reason this idea is even worth writing about is not the mechanism in itself but the fact that a number of really nice properties fall out if you link type inference and final together.

But first I have to confess that this may not have been my idea at all. All I know is that it came out of a discussion I had with Kasper Lund last winter while we worked together in Esmertec. I’ve since talked with Kasper about it and none of us can remember who actually suggested it.

The idea came out of my code style at the time. In his (semi-)recent follow-up rant on agile programming, Steve Yegge talks about slob type systems and neat-freak type systems and the slobs and neat-freaks among programmers who prefer either one or the other. Well, I am — emphatically — a neat freak (as Erik saw fit to point out recently). If you take a look a some of my source code I’m sure you’ll agree. I need my type declarations, my @Override annotations and, more than anything, I need my final variables. I can say without blushing that I was probably the biggest fan of final variables on the planet. Unfortunately, writing final on all variables whose value doesn’t change is really verbose. Hence the idea to leave out the part of a final variable declaration that is actually redundant: the type. That is, in a nutshell, where the idea came from. But there’s more to it.

Final

Historically, the concept of local variables has been tied to mutability. It’s even part of the name: it’s called a local variable, not a local constant. It’s only natural, really: the way local definitions (to use a mutable/immutable neutral word) have historically been implemented as locations on the call stack which means that it is trivial to make them mutable. Why not by default give programmers added flexibility and allow them to change them? Well, there is actually good reason not to.

What is a local “variable”? It’s a location or slot where you can store different values over time. Consider this piece of code:

String myMethod() {
String str = "bleep";
...
... // many lines of code
...
return str;
}

Say you’ve run into a NullPointerException somewhere and suspect that this method might return null, that is, that str is null when the method returns. It starts out non-null but since it’s a variable you have to read and understand, at least to some degree, all the code between the declaration and the return. Maybe there are no assignments, in which case you know the value is still "bleep" when the method returns. I would venture a guess that across all the world’s java code that is the common case: variables are usually only written once. But you don’t know until you’ve read the whole method.

Variables that are initialized and then never changed are a different beast than variables that may be written multiple times, and they are much simpler. When you look at the declaration of such a variable you know not only the type of the variable but also the value. If you knew in advance in the code above that str doesn’t change, you can safely ignore all other code in the method and focus on the value used to initialize str since you know that that’s the value that will eventually be returned. The initializer may be a complex expression, sure, but it is still a simpler problem to figure out the value of an expression than to figure out the value of an expression and figure out how and when the variable might change its value. You’re not creating a slot where you can store different values but just computing a value and then giving it name.

That’s why I started writing final on all variables whose values never changed. It makes it a lot easier to skim through code and understand what is going on. In particular, since most variables will be final, it actually makes non-final variables stand out. After I’d used this for a while I stopped noticing final, since that’s the common case, and instead started to notice non-final variables. Maybe it has never been a problem for you to distinguish between variables that change and variables that don’t but there’s no question about the fact that when you read a piece of code you have to dedicate a few mental cycles to sort this out whenever the code refers to a variable. And in my experience mental cycles is an extremely precious resource when reading code, which is why I’m such a neat-freak (well, that and plain OCD). If writing final or @Override saves cycles for future readers of my code I’ll be inclined to do it; that future reader is likely to be myself.

Writing final everywhere is verbose, though. My god is it verbose. I have always have little patience for people who think that the fewer keystrokes it takes to write a program the better. Needless to say, perl is not my favorite language. In the mythical man month, Fred Brooks says that programmers seemed to generate about the same amount of code per day regardless of the language they use (which is bad news if 25% of the code you write is the word final). Had it been someone other than Fred Brooks I’d have called it rubbish. Paul Graham says that he had seen little to contradict this hypothesis. I have no problem calling that rubbish. The logic and abstractions used in a program, those take time to design and understand. Sitting down in front of a keyboard and pushing buttons isn’t what takes the time.

Having said that, though, there’s such a thing as ad absurdum, and so a few months ago I decided that it was enough with the final. It makes the code easier to read, sure, but I’m afraid if I kept it up I’d have arthritis by the time I was 30. It was quite a relief and I doubt I’ll ever revert to my old coding style. But it is frustrating: I felt I was forced to do the Wrong Thing just because the Right Thing was too verbose even for me. Bugger.

Inference

But let’s turn to another source of sore fingers: generics. A declaration like this

HashMapnew HashMap

is around 50% redundant. And it gets worse, much worse. So obviously people have been looking for a way to remove part of this redundancy. I’ve heard several different suggestions. You can use inference to guess the type arguments of the new expression:

HashMap map = new HashMap();

I think this proposal is a bit confusing and it has the limitation that it only really works with new expression. On the plus side it allows you to use different types for the variable and the constructor:

Map map = new HashMap();

Another option is to use a shorthand syntax that resembles C++’s stack allocated object syntax:

HashMap map();

This syntax is also limited to object creation, it (sort of) collides with the method declaration syntax, and had the added limitation that the type of the variable must be a class type which is unfortunate if you want to assign other values to the variable. On the other hand if the variable is immutable I actually think it’s a pretty reasonable syntax.

Finally, you can infer the variable type from the initializer:

final map = new HashMap();

or, with the syntax proposed by James Gosling,

map := new HashMap

The advantage of this is that the syntax is not limited to variables initialized with a new expression, and the type of the variable can be an interface type. On the minus side, it can become difficult to figure out what the type of a variable is:

final obj = myMethod();

The most important thing about unifying type inference and final, though, is that is avoids a number of problems you otherwise have when mixing type inference and mutability (which is a well-known source of trouble).

A mutable variable can be assigned any number of times during its lifetime and it’s not a given that it makes sense to just use the type of the initial assignment as the type of the variable. Imagine you’re doing something like:

static Properties getProperties(String name) { ... }

props := getProperties("qbert");
if (props == null)
props = Collections.emptyMap(); // error

This code looks pretty reasonable but it doesn’t type check because the initial assignment gives props the type Properties when the intended type was actually the less specific Map (which Properties is a subtype of). It is also fragile. You could imagine that getProperties had originally been written with the return type Map, in which case the code above would work. If you then later decided that the type could be narrowed to Properties, which seems pretty innocent, you would unexpectedly break the code.

On the other hand, if you only assign to a variable once you can never go wrong in using the type of that value as the type of the variable. If you were later to narrow the type of a variable or method used to initialize such a variable you wouldn’t risk breaking code that depends on the type not being overly specific. Also, I would say that adding another level of complexity to mutable variables is not doing anyone a favor. At least, if you were to do that it should be through a syntax that stood out more than just adding a single colon (:=) which is easily overlooked. But that’s a matter of taste more than anything.

Conclusion

I think there’s a very good chance that a variable type inference mechanism will be added in Java 7. Making it easier to declare final variables would be a good thing and linking type inference with final solves a number of problems. There’s a risk that code may become harder to read because this proposal not only allows you to leave out type declarations when they’re redundant but also when they’re not. In most languages you could just trust programmers to exercise judgment when deciding whether or not to leave out variable types. Java’s design philosophy is based on not trusting people to make that kind of decision so introducing a mechanism like this would be a bit of a departure for this language. I hope that won’t end up standing in the way of introducing this or a similar mechanism.

Generic exceptions

This post is about a language mechanism they snuck into Java in 5.0 which I recently discovered. Well, to be honest I didn’t even discover it myself: most of what I’ll describe here was pointed out to me by Peter Ahé. But as far as I can tell it’s not widely known or used so I thought I’d take the opportunity to spread the word.

Here is a standard generic method, just to refresh yor memory and get into gear:

public static  T getFirst(List ts) {
return ts.get(0);
}

It takes a list and returns the first argument. It has one type parameter which is used both in the return type and in the argument types.

It turns out, however, that you can actually use type parameters in one more position in a method declaration: throws clauses:

public interface IRunnableextends Throwable> {
public void run() throws T;
}

This interface has a single type parameter which specifies which kind of exceptions may be thrown from the run method. I’ve always just assumed that you couldn’t do that so I’ve never even tried — but it turns out that you can. As far as I can tell there is no explicit rules for this in the language specification; it’s just not disallowed. The only mention I’ve been able to find of it in the JLS is a footnote in section 8.4.6.

A concrete subclass of IRunnable could be

public class FileCreator implements IRunnable  public void run() throws IOException {
// create a file
}
}

If you call the run method on a file creator you will be required to catch or declare the IOException, just as usual. What if my runnable doesn’t actually throw any exceptions? Well, you can just use IRunnable.

Since the exception type is just an ordinary type parameter, you can do all the complex stuff you can usually do with type parameters. You can leave them out:

IRunnable rawRunnable = ...;
rawRunnable.run(); // throws Throwable

and you can use wildcards

IRunnableextends DebugException> debugRunnable = ...;
debugRunnable.run(); // throws DebugExceptions

There is one thing you can’t do, though: throw multiple exceptions:

public class MyRunnable implements IRunnable<...&gt {
public void run() throws IOException, IllegalArgumentException {
// ...
}
}

As far as I can tell, there is no type argument you can give to IRunnable that would allow run to throw more than one exception type.

In my checked exceptions RFE, one argument I used was that if you write a general visitor there is no orderly way to propagate checked exceptions out of the visitor traversal — you have to handle those exceptions locally. But that is not actually true: you can just parametrize the visitor with the kind of exception it might throw and then parametrize it with a checked exception if any of the methods might throw one. I’m not sure I would do it that way, but it’s certainly a possibility at least in the cases where the visitor can only throw one kind of exceptions. But while this mechanism solves that problem, it creates another.

Since this mechanism relies on generics it can also be fooled, as you can with traditional generic classes. Here’s an example of fooling the static type system:

ListList ints = (List) strs; // warning, but legal
Integer i = ints.get(0); // cast error

With generics some operations, such a the conversion from a list of strings to a list of integers, cannot be checked at runtime. Instead, the runtime system has to wait until it has enough information to perform the check. In this case, instead of checking that the list is really a list of integers in line two, it checks that the element extracted in line three is an integer.

The reason why this works is that java’s generics were designed such that even though you might fool it in some cases, you will always be caught before you can do any damage. The type system may not be able to tell a list of strings from a list of integers, but if you try to assign something to an integer-valued variable that the static type system is unsure is an integer, it will insert a runtime check.

For this to work, however, you must eventually be in a situation where you have all the type information available to insert the check. And with exceptions that is no longer the case. Observe:

extends Throwable> void cloakException(Throwable e) throws T {
throw (T) e;
}

void myMethod() {
try {
throw new IOException("Harsh!");
} catch (IOException ioe) {
this.cloakException(ioe);
}
}

Here cloakException performs an unchecked cast to T and under normal circumstances the caller of cloakException, who has all the type information, would have been able to insert a delayed check that the conversion was really legal. In this case, however, the object is not returned but thrown. That means that the caller never gets a chance to verify the conversion. This essentially provides a way to circumvent the checked exception mechanism and throw checked exceptions freely. If you run the above code, myMethod really throws an IOException without declaring it. How about that.

It’s not something I would recommend as a programming technique, though. The thing is: you can’t really catch them again. If you want to catch a checked exception from a piece of code, the compiler will give an error if it can’t prove that the code might throw that exception. And since the whole point of the hack above is that you can throw checked exceptions without declaring them, by definition, the compiler won’t allow you to catch them again. So it’s really more an odd curiosity than a useful programming technique. But then maybe that is the case for the whole mechanism of throwing type parameters.

Checked exceptions #3

As I’ve mentioned earlier I filed a feature request on sun’s developer network for allowing checked exception to be disabled. Unfortunately, they recently decided that this was not a big enough problem that it was worth the effort to solve it. To quote the evaluation:

The use cases against having to declare checked exceptions do not justify a change of this magnitude, at present.

Sadly, the evaluation completely misses the point. I have no problem with declaring exceptions when appropriate — it’s verbose but most everything is in Java and I’ve learned to live with that. Actually, I often declare unchecked exceptions even though I don’t have to because that can be really useful as documentation. What I do have a problem with is that checked exceptions often force you to choose between breaking encapsulation or handling exceptions locally, which is in many case the wrong place to handle them.need th

What it comes down to, really, is a matter of choice: some people want to be able to choose for themselves the best place to handle exceptions. Others believe that programmers can’t be trusted to make this decision. Me, I’m pro choice.

APT

As I mentioned in my last post I’ve been playing around with database interfaces. While doing this, I’ve discovered something new: APT, the annotation processing tool for Java. I’ve known for a while that it existed but only now have I discovered what the big deal is. APT. Kicks. Ass!

In the database interface for Java I’m working on, you specify the structure of a database through annotated interfaces, essentially the same way things work in DLINQ. For instance, a database called test containing two tables would be specified like this:

public @Database("test") interface ITestDatabase extends IDatabase {
public @Table IDataSet public @Table IDataSet orders();
}

Using this interface you can open a connection to an actual relational database and fetch the contents either directly (by calling the persons or orders method) or you can specify more complex queries through something more “magical” which I’ll describe later.

The thing is that there’s a lot of ways you can get these interfaces wrong. For instance, you must use interfaces, not classes. Database interfaces must extends IDatabase. The methods that correspond to tables must return a IDataSet of some sort. And so on.

I considered this to be a problem until I discovered the annotation processor framework. An annotation processor is essentially a compiler plugin that is given access to parts of the syntax tree during compilation. That can be used for all sorts of interesting things, including issuing errors and warnings. So I’ve created an annotation processor that you can plug into your compiler and then it will check that all database-related interfaces are well-formed, and it works beautifully.

To top it all off, the compiler in eclipse now supports APT. If you just configure it correctly, which is no more difficult than using an external library, you get custom checking right in eclipse:

Of course, as with all power tools, you have show a little restraint and not issue all sorts of lame warnings just because you can…

…even if it it’s so easy the code almost writes itself:

for (MethodDeclaration method : decl.getMethods()) {
if (method.getSimpleName().equals("foo")) {
env.getMessager().printError(method.getPosition(),
"Enought with the 'foo' methods already!");
}
]

I know this tool has been around for a while. Why has it taken me so long to discover how useful it is — has there been some kind of hype that I’ve missed?

And yes, I know I’ve misspelled “enough”.

Databases

Has it really been a month since I last posted? It’s not that I’ve been particularly busy, in fact it’s been quite the opposite: after being unemployed for two months I’ve entered a state of profound laziness. It culminated last week: I did nothing but lie on my couch in an almost liquid state listening reading a song of ice and fire and listening to sigur rós. What a week that was.

But I’m slowly regaining my energy and have found a new area of interest: databases. Not databases a such, though, but support for databases in programming languages. I don’t really know anything about databases but happened to read a paper by William Cook and Ali Ibrahim, Integrating Programming Languages & Databases: What’s the Problem?.

Ideally, a programming language interface to a database should provide static type checking while still allowing queries to be parameterized a runtime, and must allow the database to optimize queries. Especially the issue of allowing queries to be optimized makes this problem difficult. Imagine that you’re given a huge database of employees and want to find all the ones called “Bob”. If you don’t care about performance you can just fetch the set of all employees from the database into your program and then select the ones with the right name:

for (IEmployee e : db.getEmployees()) {
if (e.name().equals("Bob"))
bobs.add(e);
}

If you do care the least about performance, though, you should ask the database to do all the work and just return the Bobs:

SELECT * FROM employees
WHERE name = 'Bob'

This is where things become difficult: you want to write ordinary code in your program that describes the query, like e.name().equals("Bob") but on the other hand you don’t want that code to be executed but instead sent in a query to the database.

Microsoft’s LINQ solves this by allowing code to be translated into data in the form of expression trees:

Func<int, bool> filter = (n => n < 2);
Exprint, bool>> filterCode = filter;
Console.WriteLine("{0}", filterCode.Body.Right);

Here we convert the lambda expression, (n => n < 2), into a piece of data, the second line, and then in the third line extract the Body field, n < 2, and finally the Right hand side of the body, the constant 2. When run, the program prints

ConstantExpression(2)

This solves the problem since you can now take executable code and disassemble it, which enables you to send it to the database to be executed instead of doing it yourself. But it this approach also has problems. First of all, it limits what you can write in a lambda expression since it must be possible to convert lambdas into expression trees. Also, it requires changes to the language and it feels like something that has been grafted onto the language without being generally useful, just to make things work in this special case.

So I've been playing around with a solution of my own in Java, one that doesn't require changes to the language. Instead it uses proxy objects to turn code into data. There's still plenty of problems to solve but I think it looks promising. Actually, today I ran my first query built this way, not in a database but on an ordinary collection. Here's the program:

IDataSet strs = dataSet("Burke", "Connor", "Frank", ...);
Querynew Query IString str = from(strs);
where(str.length().is(5));
select(str.toUpperCase());
}};
query.printTranscript(System.out);

Notice how the body of the query looks a whole lot like a SQL query, but is written in Java and fully type checked. And the important part is that the query body is not executed as such but turned into data. The last statement prints the disassembled query:

--- query ---
def $x0 = from(DataSet["Burke", "Connor", "Frank", ...])
where($x0.length().is(5))
select($x0.toUpperCase())
-------------

The query is not executed until it is asked to:

IDataSet result = query.perform();
System.out.println(result);

Only when you call perform is the body of the query executed. In this case the program prints

DataSet[BURKE, FRANK, DAVID]

Since the Query object contains a data representation of the code, it's a small problem to construct a corresponding SQL query to send to the database. As I said before there's still plenty of work left on this but I think solving the problem of converting code into data is a big step. If I make more progress I'll be back with an update. In any case I hope to post more often from now on.

Cheated!

Lately I’ve been working on optimizing the parser engine in tedir. It’s been going pretty well — over the last few days I’ve been able to get a 40% performance improvement (discounting I/O overhead) which brings the parse time down to around 1 second per megabyte of input for a pretty messy grammar. It’s still not as fast as the implementation I wrote for my thesis (I got it up to around 4 MB/s) but still pretty good considering that this is a generalized parser, not LL, LALR(1) or whatever.

Most of the work during parsing happens in just a handful of methods in a single class, Runtime. Optimizing tight loops like this is a lot of fun because you can learn a lot about what goes on in the VM by seeing which optimizations work as expected and which ones don’t. Sometimes I’ve made refactorings that I thought should obviously improve performance but didn’t. For instance, inlining methods by hand often doesn’t help because the VM will do that for you. On the other hand, sometimes optimizations do work as expected and that’s what makes it all worthwhile. You probably have to have tried it to know what I’m talking about.

Today, I came across an interesting example of an optimization that I was sure would give a great improvement in performance but which turned out to do exactly the opposite.

Tedir uses BitSets a lot. In particular, the methods that take most of the time during parsing do. If you’ve ever looked at the implementation of BitSet you will have noticed that it does not appear to have been, shall we say, particularly heavily optimized. At least, I got the distinct feeling that this annoying class was wasting my precious processor cycles big time. Naturally, I though I’ll write my own BitSet, FastBitSet, that’ll kick BitSet’s ass! So I did: I wrote a very simple bit set without all the bounds checks, basically BitSet with half the code or so discarded. I replaced all uses of BitSets with FastBitSets and ran the benchmarks. The result was shocking: not only was the implementation not faster, it was slower by a factor of two! I tried out various changes to the implementation but nothing helped. I got suspicious.

My colleague Kasper suggested that I should go back to java’s BitSet again and see how much time the profiler reported was spent executing methods in BitSet. Surprise: none! According to the profiler, no time at all was spent in methods in BitSet. I can only draw one conclusion from this: the VM is cheating — it must have special handling of the BitSet class. Incidentally, I work with a bunch of people that have previously worked on sun’s java VM but none of them knew anything about this optimization, at least they weren’t the ones who had implemented it. But I think the empirical evidence is pretty strong that the optimization takes place.

It’s hard to complain about an optimization that causes your code to become twice as fast without losing generality but still, I do feel kind of cheated.

Post-script: Until now my goal has been to post here roughly once a week. However, as long as the weather in Denmark is as good as it has been the last few weeks I won’t post as often. Sitting indoors in front of my computer just doesn’t have the same appeal in June as it does in January.

Equals

I recently discovered that if you look closely at the seemingly simple problem of comparing objects for equality it’s actually very hard, if not impossible, to solve satisfactorily. I looked around a bit to see if this was a well-known issue and it turns out that I’m certainly not the first person to notice this problem. See for instance here and here. This post introduces the problem (or one of the problems) of implementing object equality methods.

I have a standard template for implementing the equals method in Java:

public class X ... {

// ... body of X ...

public boolean equals(final Object obj) {
if (this == obj) return true;
if (!(obj instanceof X)) return false;
final X that = (X) obj;
// X-specific comparison
}

}

Unfortunately it turns out that implementing equals using instanceof doesn’t work in general. Let’s say you have a class LinePosition which specifies a particular line in a text file. Two line positions are equal if they represent the same line:

public boolean equals(final Object obj) {
if (this == obj) return true;
if (!(obj instanceof LinePosition)) return false;
final LinePosition that = (LinePosition) obj;
return this.getLine() == that.getLine();
}

Now you might later want to create a subclass, CharacterPosition, which not only specifies a line but also a particular character in the line. Two character positions are equal if they specify the same line and the same character in that line:

public boolean equals(final Object obj) {
if (this == obj) return true;
if (!(obj instanceof CharacterPosition)) return false;
final CharacterPosition that = (CharacterPosition) obj;
return this.getLine() == that.getLine()
&& this.getCharacter() == that.getCharacter();
}

This might look reasonable but it doesn’t work: equals is required to be symmetric but in this case it isn’t:

final LinePosition a = new LinePosition(3)
final CharacterPosition b = new CharacterPosition(3, 5);
final boolean aIsB = a.equals(b); // true
final boolean bIsA = b.equals(a); // false

The problem is that b is more picky about who it considers itself to be equal to than a. Damn.

An alternative approach is to avoid instanceof and use getClass instead:

public boolean equals(final Object obj) {
if (this == obj) return true;
if (obj == null || obj.getClass() != this.getClass()) return false;
final X that = (X) obj;
// X-specific comparison
}

Using this technique to implement equals on the text position classes would cause a and b to be equally picky about who they considered themselves to be equal with so equals would be symmetric. Unfortunately, now the objects are so picky that they break Liskov Substitutability, which is a Bad Thing.

Here’s an example where that might hit you. Let’s say that during debugging you notice that instances of LinePosition turn up in places where you didn’t expect then. If I wanted to find out where these positions came from, I would probably make an empty subclass of LinePosition for each place where I instantiate them so each instantiation site uses its own class. That way, whenever I find a line position where I didn’t expect it, I can see where it was instantiated

But if equals has been implemented using class equality I can’t do that because even though I don’t change any methods, the behavior of my program has changed in a subtle way. One position can now only be equal to another position if they were created at the same place, which is very different from how the program behaved before. I’ve tried hunting bugs that occurred because a subclass behaved differently from superclasses merely because it had a different class. Bugs like that can be extremely evil because it’s so unexpected if your mind is warped the way object orientation usually warps people’s minds.

At the core, this problem is a question about when can you consider two objects to be of the same kind of objects. Having a common superclass does not imply that two objects are the same kind, as the LinePosition and CharacterPosition example demonstrates. On the other hand, sometimes objects should be considered to be of the same kind even if they have different classes. So how do you determine whether or not two objects should be considered to be the same kind of object? In Java, I don’t know. I think the getClass approach is the greater of two evils so I’ll probably keep doing things the way I’ve always done, using instanceof. Looking outside Java and java-like languages I have an idea for a solution using a mechanism that’s a combination of Scala-style pattern matching and the visitor design pattern. But that will be the subject of another post in the near future.

Hotswap

Gilad writes about JSR292 (Supporting Dynamically Typed Languages on the JavaTM Platform) and adding support in the JVM for updating a program’s code without stopping the program. Since hotswapping is part of the general “dynamically typed languages support” JSR, I guess the main focus is on the kind of updates required by a full implementation of languages like Python or Ruby, where a program can change the class of an object, add methods to a class, etc. In other words updates where a program itself changes its own structure.

What is most interesting to me in all this, though, is the kind of hotswapping that allows you to develop applications while they’re running, as demonstrated in this smalltalk screencast. When I spent a lot of time working on a particular corner of our IDE I tend to focus too much on the details and forget the big picture. Watching that video, I was reminded of what kind of system we should, and do, aim to make OSVM, and that’s probably the kind of interactive programming Gilad has in mind when he talks about how powerful and useful hotswapping is.

In my experience you don’t need full hotswapping for it to be useful. Eclipse, for instance, only supports relatively limited updates — all you can do, more or less, is change the implementation of existing methods. It is still very useful, though, and far preferable to always having to restart everything after every little change. Even if they fall short of a general solution, any steps they can take in the right direction will be great.

Wow, that sounded a lot like me being positive. That’s so unlike me.

Implementing support for this in OSVM was, to say the least, non-trivial. And our VM is not only very simple but was also designed with this kind of use in mind. As for implementing this in the JVM… as they say, good luck with that. But of course this JSR doesn’t have to be ready until Java SE 7 so they have a few years to get it done.