Monthly Archives: January 2006

Interpol

Today, it seems like we resolved the issue that has so far been the most controversial in the design of the new neptune language: how to build strings. I wouldn’t have expected that something like that should be controversial, but it turns out that there are a lot of different ways to do this.

One of the nice things we had in smalltalk that we’ve lost in the new language is cascaded sends. In smalltalk, you can use cascaded sends (using ;) as a shorthand when sending multiple messages to the same object. This is really convenient, for instance when creating a new array:

^ (Array new) add: 1; add: 2; add: 3.

Here, I create a new array, send three messages to it which add the elements, and then return it (^ is return). To get the same effect in Java, you’d have to do something like

List a = new ArrayList();
a.add(1);
a.add(2);
a.add(3);
return a;

which is a lot less concise.

The place where I’ve used cascaded sends most is for printing. If I want to print the value of the variables x and y, I’ll write

System out show: 'x = '; show: x; show: ', y = '; show: y; nl.

Here, I’m sending four show: messages to standard out and then sending a newline. In java, the code would look something like this:

PrintStream out = System.out;
out.print("x = ");
out.print(x);
out.print(", y =");
out.print(y);
out.println();

Of course, in Java you’d never do that; instead you’d create a single string and just do one print:

System.out.println("x = " + x + ", y = " + y);

A different situation is when objects are asked to return a string representation of themselves. In Java, you’d do pretty much the same thing as when printing: construct a single string and then return it instead of printing it:

public String toString() {
return "a Point ( x = " + this.x + ", y = " + this.y + " )";
}

In our smalltalk system, there was no toString method on objects — instead, there was a printOn: method so instead of returning a string you would print the object on a stream using cascaded sends again:

printOn: aStream
aStream show: 'a Point ( x = '; show: x; show: ', y = '; show: y; show: ' )'.

It was decided pretty early on that in the new language we’d go for a model similar to Java’s (and most other languages’) rather than the printing approach used in smalltalk.

The printing approach has the advantage that it can be more efficient, since no strings are actually constructed, but on the other hand you rarely have time-critical applications that do a lot of object-printing. And our expectation is that a toString-like mechanism will be easier for programmers to work with. Finally, having a printOn: method on Object would mean that we would have to have some form of streaming support in the core library, and we’re trying to have as little code as possible in the core.

Having decided this, the question arises: what mechanism will we use to construct strings. In most languages I’ve worked with you can concatenate strings using an infix operator, often +. This works nicely if you’re concatenating strings, but often you’re creating strings from objects that aren’t strings themselves — in the point example above, x and y could be any kind of objects. So either you have to manually convert the objects that aren’t string into strings, like this…

public String toString() {
return "a Point ( x = " + this.x.toString() + ", y = " + this.y.toString() + " )";
}

…or define the string + operation so that it automatically converts its arguments into strings. Both solutions sort of suck and none of them solve the problem that concatenating strings is expensive. The java spec says that the compiler is free to optimize string concatenation using StringBuffers, but that relies on static type information (which we don’t have) and is just not a clean solution since it doesn’t work in all cases.

One of the very first methods I wrote in the new system was the infix ++ operator on objects. a ++ b returns an object whose string representation is a‘s followed by b‘s. That way, the as_string method (which is neptune for toString) could be written as

String as_string() {
return ("a Point ( x = " ++ x ++ ", y = " ++ y ++ " )").as_string();
}

The result of ++ works sort of like a StringBuffer int that it is not a string but is used to construct strings. The advantage over the java model is that we don’t need any special rules about how string concatenation works, and that the as_string method on the result of a ++ can be implemented so that constructing the result is as efficient as using a StringBuffer in Java.

We used that for a bit but later we decided that we wanted to use ++ for incrementing variables like in C. And that’s where all the trouble began. The first thing we tried was to keep the operation but just use a different operator. A lot of different operators were suggested (I liked +: or &:) but noone really liked any of them. We discussed it a few times over the last month or so but nothing came of it.

After using it for a while, another problem with ++ turned out to be that it is a bug generator: using "foo" ++ "bar" as a string will cause an error, the same way that it is illegal to use a StringBuffer in place of a string in Java; you have to use ("foo" ++ "bar").as_string(). But Java has the advantage of a type system so you’ll see the problem at compile time, whereas we don’t see the problem until runtime. So we discussed various completely different approaches — varargs, using the same model as Java, etc. When the dust settled there hadn’t been any ideas that everyone (or anyone for that matter) really liked.

But when people came in this morning, two of us had the same idea: to use string interpolation. String interpolation allows you to write general expressions within string constants, that are evaluated and “spliced” into the string:

String as_string() {
return "a Point ( x = ${x}, y = ${y} )";
}

In the above expression, when the string is evaluated, anything between ${ and } is evaluated and spliced into the string, so ${x} is replaced with the value of x and ${y} is replaced with the value of y. Perl, ruby and python all have some form of this, and my attitude has been that it was a neat but sort of cheesy construct that didn’t belong in a “serious” language. But no one had strong feelings against, except that we couldn’t agree in the exact syntax:

  "a Point ( x = ${x}, y = ${y} )"
"a Point ( x = {x}, y = {y} )"
"a Point ( x = %{x}, y = %{y} )"
"a Point ( x = $x, y = $y )"
"a Point ( x = `x`, y = `y` )"
`a Point ( x = ${x}, y = ${y} )`

…etc. etc. Currently it seems like we’re going with ${...}, which is my favorite. And then we’ll reserve $ so if you want a dollar sign in a string you’ll have to write \$, which allows us to later introduce shorthands like $x (or $x.foo or $x[0] or…). Even if I think it’s kind of cheesy I think it will be a hit, especially with C programmes who are used to printf-style string formatting.

Checked exceptions #2

In a previous post, I mentioned that I’ve filed a feature request on sun’s developer network for an option to turn off those [expletive deleted] checked exceptions in Java. Well, I had a chance to ask Peter Ahé (lead developer on sun’t java compiler) about it and he didn’t think very much of the idea. Darn. I guess a solution to this isn’t around the corner.

Update: The feature request been accepted as RFE 6376696.

iText

At work, I’m currently working on a documentation generator similar to JavaDoc or Doxygen for our new system. I browsed around a bit for a Java library for constructing PDF documents and found iText, which is one of the best structured and most well-documented java library I’ve ever come across. Their iText by Example tutorial has so many examples that anything you’ll want to do when you first try iText will be covered. And by the time you want to do something advanced that is not described in the tutorial, you’ll know enough about how things work that you’re likely to be able to figure it out yourself. Brilliant!

Structs and Memory

Last night I started working an a program for making the interface between Neptune and C code easier to work with. Here, I’ll explain what makes the interface difficult in some cases and how this new tool will make it easier. Beware: long post!

The OSVM system has a really simple interface for calling external C code. You can define a Neptune method as external by using the extern keyword:

class X {
  extern int my_external_method(int a, int b);
}

When you invoke that method, the C function named my_external_method in the underlying system is located and called. Similarly, if I want to call the standard time function in C I just make an external method called time in some Neptune object and when I invoke that method, the call will go though to the time functino in C because they have the same name. You can only pass integers as arguments, and only integers can be returned from the function. Nice and simple.

Well, except that sometimes you really want to call a C function with a piece of structured data, or have it returned. For instance, you might want to use a graphics library written in C, and in that case it would be nice to be able to pass in a text string when calling the draw_string function:

void draw_string(char *str, int x, int y);

How do you get from a fancy Neptune string with bells and whistles to a character array, and how do you pass it to the function? The solution is to use Memory objects. A Memory object allows you to allocate and deallocate memory in the underlying system. It is essentially the same as malloc and free except that a memory object checks that you don’t access memory outside of the allocated area. So you can create a character array by allocating a piece of memory of the right size, copying the characters from the Neptune string into the memory area, and then finally pass the address of the memory to the external call:

String str = "Whatever...";
Memory mem = new Memory(str.size + 1); // Allocate memory
for (int i = 0; i < str.size; i++)
  mem.set_byte(i, str[i].as_integer()); // Copy characters
mem.set_byte(str.size, 0); // Null-terminate
draw_string(mem.address, 0, 0); // Perform call
mem.free(); // Free memory

We’re still just passing integers through the C call, but one of them is a pointer to the character array which contains our string. Unfortunately, since the memory is allocated in the underlying system it is outside the reach of the garbage collector so you have to do manual memory management. However in some cases, for instance with externalizing strings, there are convenience methods that free you from dealing directly with memory objects:

str.externalize_during(fun (Memory c_string) {
  draw_string(c_string.address, 0, 0);
});

In this case, string has a method which externalizes the string, invokes the given block, and then cleans up.

A character array is a pretty simple thing and the approach above solves some but not all problems with external calls. For instance, say you want to use an external C library that can use GPS to give the current position:

struct point {
  int x, y;
}

struct point *get_position();

When you call this external function from Neptune you will get the address of a C point structure. Now, you can access the contents of this piece of memory by using a memory object. Memory objects can be used in two ways: either to allocate a fresh piece of memory or to give access to a piece of memory that has already been allocated:

int address = get_position();
Memory mem = new Memory(address, 8); // using existing memory
int x = mem.get_int(0);
int y = mem.get_int(4);

Here we have to make some assumptions about how structs are implemented in C. We guess that the point takes up 8 bytes of memory and that x and y are integers starting at byte offset 0 and 4 respectively. It is probably true, but it really is just a guess. The C standard does say something about the implementation of structs, but it doesn’t define them completely. For instance, it says that x must occur before y. However, there are holes in the standard which allows different compilers to implement structs differently. Some standard structures, for instance in network code (tcp.h), use a lot of bit fields which allows you to access individual bits or groups of bits within a struct. The C standard says that a C compiler is free to decide how to implement bit fields. So if an external call returns a TCP header I may have access to the contents of the struct, but if I want to know the value of the SYN field I have no way to know where to find it. Bugger.

Maybe I’ll experiment with my compiler and figure out that it always puts SYN in word 4 as bit 21. That’s not enjoyable work, and a TCP header has more than 15 fields so that produces a lot of nasty code full of magic numbers.
Worse, the code is now tied to a particular compiler, in fact a particular version of a particular compiler since they are free to change the implementation of structs in a later version. Bleh!

As I said in the beginning, I’ve started working on a tool for making interaction easier between Neptune and C code. In fact, the tool is exactly designed to, if not solve the problem with structs, then at least make it a lot easier to deal with. The tool processes the definition of a struct and then generates a Neptune class which wraps a memory object and provides accessors for the struct’s fields. For instance, for a simple example such as the point struct from before:

class Point {

  Memory memory;

  Point(int address) {
    memory = new Memory(address, 8);
  }

  int accessor x {
    return memory.get_int(0);
  }

  int accessor x=(int new_x) {
    memory.set_int(0, new_x);
  }

  ...

}

So now you can access a point struct as it if was a real object

Point p = new Point(get_position());
int x = p.x;
p.y = 8;

without dealing with offsets and other nastiness. Of course, that doesn’t actually solve the bitfield problem, I just wanted to show how the interface worked.

The tool deals with the bitfield problem by simply asking the the compiler how the layout of a structure is. When generating the Neptune class for a structure, you must also specify which compiler you’re using. Through various trickery, the tool gets the compiler to describe the layout of the structure and uses that information to generate the class correctly. That way the tool takes care of determining, for instance, where the SYN bit is an a TCP header is, and ensures that the code for extracting and setting that bit is correct. If the compiler changes the way it lays out structures the generated code will break, but you can just run the tool again and get new working classes. It doesn’t solve all problems but it does in fact solve the problems that can be solved. And besides figuring out the layout of bit fields, it generates code that makes for really easy access to the fields, for instance setters so you can do

TCPHeader hdr = new TCPHeader();
hdc.SYN = true;
hdr.ACK = false;

The SYN=(bool v) accessor handles all the bit fiddling and masking which causes the bits to be set correctly in the underlying C structure. This means that it will behave correctly if you pass the structure to an external call (using hdr.address):

extern void process_header(int header);
...
process_header(hdr.address);

void process_header(struct tcphdr *hdr) {
  if (hdr->SYN) {
    ...
  }
}

How does the tool figure out what the layout of a structure is? It generates a C program that uses the struct, compiles it with the specified compiler, and then runs the program which prints out a description of the structures. It then uses that description when generating the Neptune classes. I can’t take credit for that idea, Kasper came up with that. For instance, if you want to know the offset of point.y you can use the offsetof macro which is defined in stddef.h. For each member, the generated C program contains a line like this one

printf("%s %s %i", "point", "y", offsetof(struct point, y));

Lo and behold, when I run the program it prints out

point y 4

One problem with this is that it doesn’t work with bit fields — I can’t do offsetof(struct tcphdr, SYN) because you can’t take the address of a bit field and that’s how offsetof is implemented. Instead, I use a little trick which I can take credit for: I simply fill the struct with zeroes, then set the field to the largest possible value it can contain, and then scan through the struct bit-by-bit to find the area where bits have been set.

struct tcphdr my_hdr;
memset(&my_hdr, 0, sizeof(my_hdr));
my_hdr.SYN = 1;
find_set_bits(&my_hdr);

Cheesy eh…

The tool is not really done but neither is the new OSVM containing Neptune. Both will be freely available later this spring…

Feynman Lectures

I was browsing around for interesting podcasts when I stumbled upon some videos of four lectures on physics given by Richard Feynman in 1979. Brilliant stuff!