So, It’s been a month and a half since I last posted anything. It’s not that I’ve given up writing here it’s just that half of what I’m doing these days I’m not allowed to tell anyone and the other half I’m too busy doing to write about. However, I fell across a useful compiler implementation technique that I thought I’d take the effort to write about. It has to do with abstract syntax trees pretending to be bytecode (and vice versa).
I am, as usual, playing around with a toy language — the second one since neptune died. This time I thought I’d shake things up a little and implement both the compiler and runtime it in C++ rather than Java. My first attempt at a compiler was straightforward: the parser read the source and turned it into a syntax tree, the syntax tree was used to generate bytecode, the bytecode was executed by the runtime.
It turns out, however, that the obvious approach is pretty far from optimal. The compiler is really stupid and doesn’t actually use the syntax tree except to do a simple recursive traversal when generating the bytecode. On the other hand, the syntax tree would be really handy if I wanted to JIT compile the code at runtime, but at that point it is long gone and there is only the bytecode left.
I tried out different approaches to solving this problem and ended up with something that is a combination of bytecode and syntax trees. In particular, it has the advantages of both bytecode and syntax trees at the cost of a small overhead. Here’s an example. The code:
def main() {
if (1 < 2) return a + b;
else Console.println("Huh?");
}
is turned by the parser into the following code:
Syntax Tree Code (entry point: 49)
0 literal 1
2 literal 2
4 invoke @0.<(@2)
9 slap 1
11 if-false ~17
13 global "a"
15 global "b"
17 invoke @13.+(@15)
22 slap 1
24 return @17
26 goto ~13
28 global "Console"
30 literal "Huh?"
32 invoke @28.println(@30)
37 slap 1
39 if-else @4 ? @24 : @32
43 pop 1
45 literal #
47 return @45
49 block [@39 @47]
It it essentially a bytecode format with some extra information that allows you to reconstruct the syntax tree. A bytecode interpreter would execute it by starting at instruction 0 and ignoring the extra structural information:
0 literal 1
2 literal 2
4 invoke@0.<(@2)
9 slap 1
11 if-false ~17
13 global "a"
15 global "b"
17 invoke@13.+(@15)
22 slap 1
24 return@17
26 goto ~13
28 global "Console"
30 literal "Huh?"
32 invoke@28.println(@30)
37 slap 1
39if-else @4 ? @24 : @32
43 pop 1
45 literal #
47 return@45
49block [@39 @47]
On the other hand, if you want to reconstruct the syntax tree you can read the code backwards, starting from the last instruction. In this case the last instruction says that the preceding code was generated from a statement block with two statements, the ones ending at 39
and 47
. If you then jump to position 39
you can see that the code ending there was generated from an if statement with the condition ending at 4
, the then-part ending at 24
and the else-part ending at 32
. Just like an interpreter would ignore the extra structural information, a traversal will ignore the instructions that carry no structural information such as goto
and pop
:
/--> 0 literal 1
+--> 2 literal 2
\--- 4 invoke @0.<(@2) <---\
9slap 1|
11if-false ~17|
13 global "a" <--\ |
15 global "b" <--+ |
/--> 17 invoke @13.+(@15) ---/ |
| 22slap 1|
\--- 24 return @17 <---+
26goto ~13|
/--> 28 global "Console" |
+--> 30 literal "Huh?" |
\--- 32 invoke @28.println(@30) <---+
37slap 1|
/---> 39 if-else @4 ? @24 : @32 ----/
| 43pop 1
| 45 literal #
+---> 47 return @45
\---- 49 block [@39 @47]
Here I’ve tried to illustrate the way the instructions point backwards using puny ASCII art. Anyway, you get the picture.
When you want to traverse the syntax tree embedded is a piece of code you don’t actually have to materialize it. Using magical and mysterious features of C++ you can create a visitor that gives the appearance of visiting a proper syntax tree when it’s really just traversing the internal pointers in one of these chunks of syntax-tree-/byte-code.
Pretty neat, I think. Having a closed syntax tree format (by which I mean one that doesn’t use direct memory pointers) means that you can let it flow from the compiler into the runtime, so you can dynamically inspect the syntax tree for a piece of code. It’s also easy to save in a binary file or send over the network.
More importantly it allows syntax trees to flow from running programs back into the compiler at runtime. One of the things I’m experimenting with this time is metaprogramming and representing code as data. In particular, I’m going for something in the style of MetaML but with dynamic rather than static typing. For that, this format will be very handy.
Pingback: Status, end of January » Hummus and Magnets