Monthly Archives: June 2006

Types

The last big thing in neptune that I haven’t written about is the static type system. We’re not actually 100% done with it yet — the design is mostly done but there is still a few late additions that we seem to agree are good ideas but that we haven’t implemented yet. I think we’re far enough, though, that I can give a pretty accurate overview.

The way the type system in neptune came about is kind of unusual. Neptune descends from smalltalk which is not statically typed, but an important goal in the design of the language was to make it look familiar to C and C++ programmers. If you look at these two versions of the same program:

int gcd(int a, int b) {
while (a != 0) {
int c = a;
a = b % a;
b = c;
}
return b;
}

and

gcd(a, b) {
while (a != 0) {
var c = a;
a = b % a;
b = c;
}
return b;
}

the first program looks just like C (because that’s what it is) and the second one looks sort of C-ish but also pretty different. The only difference between the two programs is, of course, that one has type annotations and the other one doesn’t. When we set out to design neptune there was really no question: if we wanted it to look like C, we had to have a static type system. That is not to say that this was the only reason to have a static type system, there are plenty of very good reasons for that, but it is the reason why not having a type system was never an option. Besides the syntax, the most important reasons were that it allows you to write documentation that can be understood and checked by the compiler which again enables much better tool support. The most important non-reasons were optimization security — they were basically irrelevant in the design.

So, what’s the neptune type system like? Well, first of all it’s an optional type system. This means that you are never required to specify a type for anything. For instance, the two gcd programs above are both legal neptune code1. Basically, we want the type system to be there as a service that you can use if you want to but which doesn’t impose itself on you. From a “genealogical” viewpoint, it is closely related to the Strongtalk type system for smalltalk.

The examples above demonstrate the type syntax. If you use type annotations the syntax is the same as in the C language family. If you don’t use type annotations you either just leave the type out, as with return types and parameters, or use the keyword var, as with local variables. In some cases, for instance method parameters, you can do both.

The type system is based on the concept of protocols, which is smalltalk for what most people know as “interfaces”. A protocol is simply a collection of methods:

protocol SimpleOutputStream {
void write(int b);
void flush();
void close();
}

Unlike most statically typed object-oriented languages, including Java and C#, neptune does not have class types. Class types are a Bad Thing because they not only specify the interface of an object but also dictate how the object must be implemented, breaking encapsulation.

There are two ways to define a protocol. One way is to write it directly, as the SimpleOutputStream example above. Also, whenever you define a class you automatically get a protocol with the same name. That might sound confusing but in practice it isn’t. Protocols work much the same way as interfaces, and protocols defined explicitly work exactly the same way as protocols defined by classes. For instance, you are free to the protocol of a class without subclassing it:

class MyString is String {
...
}

In this declaration, “is String” means that MyString implements the protocol of String, not that it extends String — the superclass of MyString is in this case Object. To create a subclass of String you would write

class MyString: String {
...
}

A class can extend one other class but implement an arbitrary number of protocols.

If you use String as the type of a variable it’s usually safe to think of it as if the variable can contain instances of String. But that’s is not accurate in general: the variable can actually contain any object that implements String‘s protocol. Neptune has no way to express that a variable can only contain instances of a particular class.

Programs can dynamically test whether or not an object supports a particular protocol by using the is operator:

if (obj is String) ...

One of the things we haven’t implemented yet but seem to agree is a good idea is to combine this with a limited form of dependent types. In most other languages, for instance Java, the compiler doesn’t “remember” instanceof checks:

if (obj instanceof String) {
label.setText(obj); // illegal: obj not String
}

The compiler doesn’t recognize that you only enter the body of the if if obj is a String — instead, you have to insert a cast:

if (obj instanceof String) {
String str = (String) obj;
label.setText(str);
}

In neptune, we plan to allow the compiler to automatically remember extra type information about local variables after is checks, something that will in many cases make casts unnecessary:

if (obj is String) {
label.setText(obj); // okay, obj is a String
}

In a language like Java that might make programs harder to understand, since the runtime behavior of the program can be affected if the compiler gains more type information. In neptune all it does is avoid unnecessary type warnings, since the runtime behavior of a program is not affected by static types.

This leads me to another aspect: type warnings. In neptune, all problems detected by the type checker are reported as warnings, not errors. It doesn’t matter how wrong you get the types, the program will still compile and run. Well, except for one case: you can’t mix function objects (blocks) and other objects, you have to get the types right there. That is because the compiler needs to be able to track that no references exist to a block after the method that created it returns — otherwise the runtime will die horribly. The philosophy of when to give warnings and errors is simple: we only give errors in a program if we can’t give meaning to the program, otherwise we give warnings. You won’t, for instance, see errors on unreachable code or other such nonsense. You can configure the IDE to give errors instead of warnings but that’s not the default.

Neptune’s static type system is “weak”. Even if x has type int there is no guarantee that it can only contain integers:

var my_string = "blah";
int x = my_string;
int y = x + 18;

In the second line an object of an unknown type is assigned to an integer-typed variable. That is perfectly legal and does not cause a warning so this code will compile without issuing any errors or warnings but will probably fail at runtime in the third line when adding an integer to a string. That doesn’t mean that you can use this to break the runtime. The runtime is strongly typed so no matter how wrong you get the static types it will survive. You might ask why we don’t add runtime checks to catch these situations but that is potentially expensive and it’s probably not worth the trouble. We could also add warnings if we’re not sure that an assignment is safe, but that would generate warnings whenever untyped code called typed code, which goes against the principle that the type system should never impose itself on you if you decide not to use it. Personally, this has never been a problem for me, not even once.

The other thing we’ve discussed and seem to agree upon but haven’t implemented yet are nullable types. By default all types contain the null value:

String str = null; // legal

But often a value is not actually supposed to be null. Nullable types allow you to express this:

String! str = null; // illegal
String? str = null; // legal
String str = null; // legal

The type String! does not allow null whereas String? does. The type String also allows null but is different from String? in that Strings are completely unchecked whereas using a String? in place of a String! without a null check causes a warning.

Ok, I think that’s enough about types for now. There are still interesting details about the type system that I haven’t written about — protocol objects, the void type, etc. — but that will have to wait.


1In case you were wondering, int is not a basic type in neptune — neptune has no basic types. It is just a convenient shorthand for the type Integer.

Hadrian 0.2

As recently mentioned, I’ve been working on improving the performance of my parser library. Now that I’m done picking all the low-hanging fruits, performance wise, I have decided to release the next version of hadrian, 0.2. This version contains a bunch of bugfixes and a few new features, but mainly it’s just a lot faster than 0.1.

Last time I wrote about it throughput was around 1 MB of input per second, discounting I/O, which was already a considerable improvement over 0.1. Now, it is around 1.5 MB per second. If you include I/O overhead that is around 1.2 MB per second on my machine. That’s actually faster than ANTLR, at least according to the benchmarks I used for my thesis.

Of course it doesn’t matter how fast it is. Even though it really is the world’s most ass-kickingest parser, people are unlikely to use it until I write some more documentation. Bugger.

Cheated!

Lately I’ve been working on optimizing the parser engine in tedir. It’s been going pretty well — over the last few days I’ve been able to get a 40% performance improvement (discounting I/O overhead) which brings the parse time down to around 1 second per megabyte of input for a pretty messy grammar. It’s still not as fast as the implementation I wrote for my thesis (I got it up to around 4 MB/s) but still pretty good considering that this is a generalized parser, not LL, LALR(1) or whatever.

Most of the work during parsing happens in just a handful of methods in a single class, Runtime. Optimizing tight loops like this is a lot of fun because you can learn a lot about what goes on in the VM by seeing which optimizations work as expected and which ones don’t. Sometimes I’ve made refactorings that I thought should obviously improve performance but didn’t. For instance, inlining methods by hand often doesn’t help because the VM will do that for you. On the other hand, sometimes optimizations do work as expected and that’s what makes it all worthwhile. You probably have to have tried it to know what I’m talking about.

Today, I came across an interesting example of an optimization that I was sure would give a great improvement in performance but which turned out to do exactly the opposite.

Tedir uses BitSets a lot. In particular, the methods that take most of the time during parsing do. If you’ve ever looked at the implementation of BitSet you will have noticed that it does not appear to have been, shall we say, particularly heavily optimized. At least, I got the distinct feeling that this annoying class was wasting my precious processor cycles big time. Naturally, I though I’ll write my own BitSet, FastBitSet, that’ll kick BitSet’s ass! So I did: I wrote a very simple bit set without all the bounds checks, basically BitSet with half the code or so discarded. I replaced all uses of BitSets with FastBitSets and ran the benchmarks. The result was shocking: not only was the implementation not faster, it was slower by a factor of two! I tried out various changes to the implementation but nothing helped. I got suspicious.

My colleague Kasper suggested that I should go back to java’s BitSet again and see how much time the profiler reported was spent executing methods in BitSet. Surprise: none! According to the profiler, no time at all was spent in methods in BitSet. I can only draw one conclusion from this: the VM is cheating — it must have special handling of the BitSet class. Incidentally, I work with a bunch of people that have previously worked on sun’s java VM but none of them knew anything about this optimization, at least they weren’t the ones who had implemented it. But I think the empirical evidence is pretty strong that the optimization takes place.

It’s hard to complain about an optimization that causes your code to become twice as fast without losing generality but still, I do feel kind of cheated.

Post-script: Until now my goal has been to post here roughly once a week. However, as long as the weather in Denmark is as good as it has been the last few weeks I won’t post as often. Sitting indoors in front of my computer just doesn’t have the same appeal in June as it does in January.

CbfgFpevcg

One of my favorite languages is PostScript. It’s not so much the language itself — it’s simple and general but it can be a bit of a hassle to use — but the idea to use a general programming language as a document format. I’ve been playing around a bit lately and have written two small PostScript programs that can transform PS documents: munch-rot13.ps and munch-nl.ps. The one script ROT13s the document (example) and the other one simply swaps the letter N and the letter L (example). The N/L thing is sort of a tradition at daimi.

How do you “run” the programs? You simply concatenate the program with the document you want to transform. To apply ROT13 to document.ps you simply write

$ cat munch-rot13.ps document.ps > document-rot13.ps

You can apply the transformations as many times as you want so given a document encoded with munch-rot13.ps you can decode it by just applying munch-rot13.ps again. And, incidentally, if want to decode a document but don’t have munch-rot13.ps, don’t worry: since the document was encoded by appending it to the script, the encoded document actually carries around the script.

You can also mix the transformations: apply ROT13, then N/L and then ROT13 again. That causes A and Y to be switched in the document.

Pointless? Maybe, but I thought it was an interesting challenge. And, as Mayer Goldberg demonstrates, you can actually write useful programs in PostScript. Maybe I’ll try that some day. I’m not completely done with pointlessness yet, though: I’m considering writing a version that Enigmas the document.

Hadrian Demo

My pet project over the last few years is a parser library for Java called hadrian. I was lucky enough to be able to write my master’s thesis about this project but that was neither the beginning nor the end of it. Since my thesis I’ve rewritten it completely and now we’re using the new library in the neptune compiler.

One thing I’ve neglected, though, is documentation. I’ll try to write a few posts about it here at some point, to get started, but until then I’ve thrown together a demo applet which demonstrates some aspects of the library. The demo allows you to write a grammar and some input and then parses the input according to the grammar and shows the result as a syntax tree and XML. It also contains a few examples to play around with.

In itself, this demonstrates two things. Firstly, that hadrian allows you to parse input according to a grammar that wasn’t generated a compile-time but constructed by the running program. Secondly, that the result of parsing the input is constructed by the framework — you don’t have to specify your own ASTs. That’s only a small pat of what you can do but it’s pretty cool in itself. If I do say so myself. Which I do.

The code that does the actual work of the applet is pretty simple. Here it is (I’ll explain what it does below)

  private void processInput(final String grammarString,
final String inputString) {
1 final ITextDocument grammarDoc = new StringDocument(grammarString);
2 final MessageMonitor monitor = new MessageMonitor();
3 final HadrianParser parser =
HadrianReader.getDefault().readParser(grammarDoc, monitor);
4 if (processMessages(monitor)) return;
final ITextDocument inputDoc =
new StringDocument(inputString);
5 final INode node = parser.parse(inputDoc, monitor);
if (processMessages(monitor)) return;
final DefaultTreeModel treeModel =
new DefaultTreeModel(SyntaxTreeNode.create(node, null));
6 syntaxTree.setModel(treeModel);
7 syntaxXML.setText(TextUtils.toXML(node));
}

Here’s what’s going on in the code.

  1. The string containing the grammar is wrapped in a text document. A text document is pretty similar to a string except that it keeps track of where line breaks are and stuff like that. That can be very handy for instance when reporting errors in a document.
  2. We create a message monitor. A message monitor collects any messages that the system might generate while processing the grammar.
  3. We read the grammar and construct a parser. If this goes well, readParser returns a parser we can use to parse the input. If something goes wrong it reports an error to the message monitor which we…
  4. …check for and bail out if necessary.
  5. We wrap the string containing the input and parse it using the parser we just successfully constructed from the grammar.
  6. After checking that parsing went well we take the result, a syntax tree in the form of an INode, and wrap it in something that can be displayed in the UI.
  7. Finally, we also convert the syntax tree to XML and plant the result in the appropriate text box

Grammars don’t have to be read from source files or strings, there’s also an API that allows programs to construct them directly. But it turns out to be pretty handy to be able to read them from source files.

If you want to take a look at how the applet is implemented, the source is available with the code in tedir-applet.jar. Hadrian itself lives at sourceforge, as a part of the tedir project.