I'm working on our new neptune plugin for eclipse and I draw a lot of inspiration from the JDT java plugin that comes with eclipse. One thing I really like is that you can do refactorings and navigation in a file even if parts of the source is not legal java code. The parser they use apparently just skips the illegal parts and gets right back on track as soon as the source becomes parseable again. For instance, the parser has no problem reading this nonsense:

public class Klass {

int float double foo
String bar() {
List baz() { { {
float %$%$ quux();
try catch finally if else
class K {

}

This class is shown in the outline as having one field, foo, three methods, bar, baz and quux, and an inner class K. I would have expected the unmatched brackets or illegal characters to throw the parser off a bit but it doesn't seem to care.

My first guess was that maybe they used the indentation as a guide but apparently whitespace is completely irrelevant -- putting the whole thing on one line makes no difference. So to understand how they did it I ended up single-stepping through the gory innards of JDT until I found the parser and hoped to see all sorts of well-documented heuristic rules that I could understand and use in our own plugin.

Full of anticipation, I stepped over lines and lines of initialization code, saw the first token being read from the input and finally reached the sacred inner loop of the parser. After having gone around in the loop a few times I noticed that what appeared to control everything was the result of a single method call. Ready for the big revelation, I stepped into the method and found a single line:

return term_action[term_check[base_action[state]+sym] 
== sym ? base_action[state] + sym : base_action[state]]

Disappointed! I hate automaton-based parsers.