Interpol

Today, it seems like we resolved the issue that has so far been the most controversial in the design of the new neptune language: how to build strings. I wouldn’t have expected that something like that should be controversial, but it turns out that there are a lot of different ways to do this.

One of the nice things we had in smalltalk that we’ve lost in the new language is cascaded sends. In smalltalk, you can use cascaded sends (using ;) as a shorthand when sending multiple messages to the same object. This is really convenient, for instance when creating a new array:

^ (Array new) add: 1; add: 2; add: 3.

Here, I create a new array, send three messages to it which add the elements, and then return it (^ is return). To get the same effect in Java, you’d have to do something like

List a = new ArrayList();
a.add(1);
a.add(2);
a.add(3);
return a;

which is a lot less concise.

The place where I’ve used cascaded sends most is for printing. If I want to print the value of the variables x and y, I’ll write

System out show: 'x = '; show: x; show: ', y = '; show: y; nl.

Here, I’m sending four show: messages to standard out and then sending a newline. In java, the code would look something like this:

PrintStream out = System.out;
out.print("x = ");
out.print(x);
out.print(", y =");
out.print(y);
out.println();

Of course, in Java you’d never do that; instead you’d create a single string and just do one print:

System.out.println("x = " + x + ", y = " + y);

A different situation is when objects are asked to return a string representation of themselves. In Java, you’d do pretty much the same thing as when printing: construct a single string and then return it instead of printing it:

public String toString() {
return "a Point ( x = " + this.x + ", y = " + this.y + " )";
}

In our smalltalk system, there was no toString method on objects — instead, there was a printOn: method so instead of returning a string you would print the object on a stream using cascaded sends again:

printOn: aStream
aStream show: 'a Point ( x = '; show: x; show: ', y = '; show: y; show: ' )'.

It was decided pretty early on that in the new language we’d go for a model similar to Java’s (and most other languages’) rather than the printing approach used in smalltalk.

The printing approach has the advantage that it can be more efficient, since no strings are actually constructed, but on the other hand you rarely have time-critical applications that do a lot of object-printing. And our expectation is that a toString-like mechanism will be easier for programmers to work with. Finally, having a printOn: method on Object would mean that we would have to have some form of streaming support in the core library, and we’re trying to have as little code as possible in the core.

Having decided this, the question arises: what mechanism will we use to construct strings. In most languages I’ve worked with you can concatenate strings using an infix operator, often +. This works nicely if you’re concatenating strings, but often you’re creating strings from objects that aren’t strings themselves — in the point example above, x and y could be any kind of objects. So either you have to manually convert the objects that aren’t string into strings, like this…

public String toString() {
return "a Point ( x = " + this.x.toString() + ", y = " + this.y.toString() + " )";
}

…or define the string + operation so that it automatically converts its arguments into strings. Both solutions sort of suck and none of them solve the problem that concatenating strings is expensive. The java spec says that the compiler is free to optimize string concatenation using StringBuffers, but that relies on static type information (which we don’t have) and is just not a clean solution since it doesn’t work in all cases.

One of the very first methods I wrote in the new system was the infix ++ operator on objects. a ++ b returns an object whose string representation is a‘s followed by b‘s. That way, the as_string method (which is neptune for toString) could be written as

String as_string() {
return ("a Point ( x = " ++ x ++ ", y = " ++ y ++ " )").as_string();
}

The result of ++ works sort of like a StringBuffer int that it is not a string but is used to construct strings. The advantage over the java model is that we don’t need any special rules about how string concatenation works, and that the as_string method on the result of a ++ can be implemented so that constructing the result is as efficient as using a StringBuffer in Java.

We used that for a bit but later we decided that we wanted to use ++ for incrementing variables like in C. And that’s where all the trouble began. The first thing we tried was to keep the operation but just use a different operator. A lot of different operators were suggested (I liked +: or &:) but noone really liked any of them. We discussed it a few times over the last month or so but nothing came of it.

After using it for a while, another problem with ++ turned out to be that it is a bug generator: using "foo" ++ "bar" as a string will cause an error, the same way that it is illegal to use a StringBuffer in place of a string in Java; you have to use ("foo" ++ "bar").as_string(). But Java has the advantage of a type system so you’ll see the problem at compile time, whereas we don’t see the problem until runtime. So we discussed various completely different approaches — varargs, using the same model as Java, etc. When the dust settled there hadn’t been any ideas that everyone (or anyone for that matter) really liked.

But when people came in this morning, two of us had the same idea: to use string interpolation. String interpolation allows you to write general expressions within string constants, that are evaluated and “spliced” into the string:

String as_string() {
return "a Point ( x = ${x}, y = ${y} )";
}

In the above expression, when the string is evaluated, anything between ${ and } is evaluated and spliced into the string, so ${x} is replaced with the value of x and ${y} is replaced with the value of y. Perl, ruby and python all have some form of this, and my attitude has been that it was a neat but sort of cheesy construct that didn’t belong in a “serious” language. But no one had strong feelings against, except that we couldn’t agree in the exact syntax:

  "a Point ( x = ${x}, y = ${y} )"
"a Point ( x = {x}, y = {y} )"
"a Point ( x = %{x}, y = %{y} )"
"a Point ( x = $x, y = $y )"
"a Point ( x = `x`, y = `y` )"
`a Point ( x = ${x}, y = ${y} )`

…etc. etc. Currently it seems like we’re going with ${...}, which is my favorite. And then we’ll reserve $ so if you want a dollar sign in a string you’ll have to write \$, which allows us to later introduce shorthands like $x (or $x.foo or $x[0] or…). Even if I think it’s kind of cheesy I think it will be a hit, especially with C programmes who are used to printf-style string formatting.

Leave a Reply

Your email address will not be published. Required fields are marked *


*