Once every few years I end up, for one reason or another, having to implement an operator precedence parser. Getting it right is a bit fiddly. I once found a great article about how you do it and tend to just follow that but a while ago that article disappeared.

Now, again, I find myself having to implement an operator precedence parser, in yet another language, and so I dug up the old article on archive.org and decided: since I find it super useful and it’s completely gone from the web, maybe I should just host a copy here. So here it is. It’s a design doc from the sugar parser library written by Douglas Gregor. So, to be clear, I didn’t write this doc I’m just reposting it here (with a few minimal formatting changes) because it’s good and that way it’s at least available somewhere.

If anyone, particularly Douglas Gregor, has an opinion or an objection please leave a comment.

Shift/Reduce Expression Parsing

Introduction

The Sugar expression parser is similar to the class of operator-precedence parsers, but has been extended to support common requirements when parsing expressions, such as function application, confix (grouping) operators, and operator name disambiguation. Additionally, Sugar is intended to be usable without any precompiling phase, making it ideal for rapid or on-the-fly construction of expression parsers.

Expressions

For the purposes of this document, an expression is a sequence of operators and operands, where operators fall into one of the following categories:

Type	Arity	Placement	Examples
Prefix	Unary	Prior to operand	Unary minus
Postfix	Unary	After operand	Factorial
Infix	Binary	Between operands	Addition, multiplication, and division
Confix	Unary	Surrounding operand	Parentheses, half-open ranges
Function application	Binary	After first operand and surrounding second operand	Mathemetical functions (sin(x)), array indexes(a[5])

The confix and function application operators are essentially split into their component parts, an open symbol and a close symbol, during the parsing phase. The “open” symbol will occur on the left-hand side and the “close” symbol will occur on the right-hand side.

Constructing the parser

The expression parser is a shift/reduce parser with zero lookahead that utilizes two separate stacks: one for operators and one for operands. Any operands in the input stream are immediately shifted onto the operator stack; operators are immediately shifted onto the operator stack only if the operator stack is empty. Otherwise, the following table determines the action of the parser depending on the type of the operator on top of the operator stack and on the type of the current operator token.

	Current operator →
Top of stack ↓		Prefix	Postfix	Infix	Confix Open	Confix/Function Close	Function Open	End of Input
	Prefix	shift	precedence	precedence	shift	reduce	precedence	reduce
	Postfix	-	reduce	reduce	-	reduce	reduce	reduce
	Infix	shift	precedence	precedence/associativity	shift	reduce	precedence	reduce
	Confix Open	shift	shift	shift	shift	shift	shift	reduce
	Confix/Function Close	reduce	reduce	reduce	reduce	reduce	reduce	reduce
	Function Open	shift	shift	shift	shift	shift	shift	reduce

Description of parsing actions

A shift operation pushes the current operator token onto the operator stack.
A reduce operation pops the operator token off the top of the operator stack, and then pops the appropriate number of operands from the operand stack. Then the operator is applied to the operand(s) and the result is pushed back on the operand stack. Reduction of confix operators and of function application requires popping two operators off the operator stack.
A precedence operation compares determines the relative precedence of the operator on top of the operator stack (top) and the current operator (current).
- If top has a lower precedence than current, shift.
- If top has a higher precedence than current, reduce.
A precedence/associativity operation first compares the precedence according to the precedence operation: if the precedence is equivalent, associativity is considered:
- If top associates left of current, reduce.
- If top associates right of current, shift.

Rejecting Invalid Expresions

Operator-precedence parsers are often not used because they accept invalid strings. The shift-reduce parser as specified above will consider the expressions x + x, + x x, and x x + equivalent, even though only the first form is correct. This weakness is easily remedied with the use of the following state machine to track what type of operator or operand is expected at any given point in time.

state_machine

The state machine contains three states: the pre-operand state where we collect confix open and prefix operators while waiting for an operand, the post-operand state where we have received an operand and are applying postfix operators to it and closing confix operators or finishing function calls, and finally an error state that will be entered when an invalid parse is detected.

Disambiguation of Operator Names

Within many domains, certain operators are reused in different contexts. Several obvious examples are the unary and binary minus operators that use the same symbol ‘-‘, the absolute-value confix operator that uses the symbol ‘|’ as both its open and close symbol, and the ‘+’ operator for regular expressions that is both a postfix positive closure operator and an infix operator for specifying alternatives.

Disambiguation of operator names is in many cases directly related to the state machine used to identify invalid sequences. Given any operator name, we determine the set of operator types that it may belong to. We then intersect this with the set of operator types that are valid at our current state within the state machine to determine role(s) this operator may play in this context. Several cases are left ambiguous by this intersection. These cases are considered below with either a specific resolution or are considered impossible by this class of parser.

Disambiguation at this phase requires lookahead of one additional token, and is also based on the state machine. Disambiguation is possible when the possible meanings of the operator differ in the states that will result from their interpretation. For instance, if a given operator is both postfix and infix, the postfix interpretation would remain in the post-operand state whereas the infix interpretation would transfer to the pre-operand state. Looking ahead one symbol, we can determine if the next symbol would be valid in either state: if it is valid in only one of the resulting states, we can disambiguate the prior (non-lookahead) symbol to ensure that the appropriate state is reached so that the lookahead symbol will not result in a parse error.

{prefix, confix open}: ambiguous (requires arbitrary lookahead).
{postfix, infix}: single lookahead disambiguation based on state.
{confix/function close, infix}: single lookahead disambiguation based on state.
{function open, infix}: ambiguous (requires arbitrary lookahead).
{postfix, confix/function close}: ambiguous (requires arbitrary lookahead).
{postfix, function open}: single lookahead disambiguation based on state.
{function open, function/confix close}: single lookahead disambiguation based on state.

Parsing Examples

Mathematical Expressions

Parse the expression x * |y+z| + -3^x^y using the standard mathematical rules for precedence and associativity.

State	Operand Stack	Operator Stack	Token	Token type	Action
Pre			x	operand	shift
Post	`x`		*	infix operator	shift
Pre	`x`	*	\|	confix open or confix close	disambiguate as confix open, shift
Pre	`x`	* (confix open \|)	y	operand	shift
Post	`x y`	* (confix open \|)	+	infix or prefix operator	disambiguate as infix, shift
Pre	`x y`	* (confix open \|) +	z	operand	shift
Post	`x y z`	* (confix open \|) (infix +)	\|	confix open or confix close	disambiguate as close, reduce
Post	`x (y+z)`	* (confix open \|)	\|	confix open or confix close	disambiguate as close, reduce
Post	`x (\|y+z\|)`	*	+	infix or prefix	disambiguate as infix, compare precedence, reduce
Post	`(x * (\|y+z\|))`		+	infix or prefix	disambiguate as infix, shift
Pre	`(x * (\|y+z\|))`	(infix +)	-	infix or prefix	disambiguate as prefix, shift
Pre	`(x * (\|y+z\|))`	(infix +) (prefix -)	3	operand	shift
Post	`(x * (\|y+z\|)) 3`	(infix +) (prefix -)	^	infix	compare precedence, shift
Pre	`(x * (\|y+z\|)) 3`	(infix +) (prefix -) ^	x	operand	shift
Post	`(x * (\|y+z\|)) 3 x`	(infix +) (prefix -) ^	^	infix	compare precedence, compare associativity, shift
Pre	`(x * (\|y+z\|)) 3 x`	(infix +) (prefix -) ^ ^	y	operand	shift
Post	`(x * (\|y+z\|)) 3 x y`	(infix +) (prefix -) ^ ^	end	end	reduce
Post	`(x * (\|y+z\|)) 3 (x^y)`	(infix +) (prefix -) ^	end	end	reduce
Post	`(x * (\|y+z\|)) (3^(x^y))`	(infix +) (prefix -)	end	end	reduce
Post	`(x * (\|y+z\|)) (-(3^(x^y)))`	(infix +)	end	end	reduce
Post	`((x * (\|y+z\|)) + (-(3^(x^y))))`	empty	end	end	accept

Douglas Gregor

Last modified: Sat Aug 18 12:46:13 EDT 2001