This is my second post in a series about Babbage’s mechanical calculating engines. The first post was about why it was so important in the early 1800s to be able to produce accurate arithmetic tables and ended with what I’ll be generous and call a cliffhanger: that Babbage’s difference engine was a groundbreaking solution to that problem. In this post I’ll explain how the difference engine works, how to “code” it and how to interpret what you get back.

Before we get to the engine itself there’s a bit of math to explain the basic principles it works by. To give you a taste of the engine itself before that here is a small JavaScript emulator1 that performs a simple calculation. It’s tiny compared to the real engine but works by the same principles just on a smaller scale.

Try stepping it a few times. The model is actually quite straightforward and the computations you can use it for are based on some basic mathematical principles. You can treat it as an exercise before reading the rest if you like: try to work out what it’s calculating, how it does it, and how you can make it calculate other things. Of course you can also just read on an I’ll tell you.


The difference engine is a very simple device in principle. It’s an adding machine, that’s all it is2. Part of what makes it such an impressive accomplishment is that it took a difficult problem, calculating complex functions, and solved it using a simple enough approach that it could be implemented mechanically using contemporary technology3.

So, how do you calculate complex functions using just addition? Obviously you can’t in general so let’s start off easy by looking just at polynomials.

Consider this simple polynomial, the square function:

f(x) = x^2

The first few values are

  \begin{tabular}{lcl}  f(0) & = & 0 \\  f(1) & = & 1 \\  f(2) & = & 4 \\  f(3) & = & 9 \\  & \vdots &  \end{tabular}

The difference engine is based on a technique called divided differences. Divided differences is similar to differentiation but simpler and it’s based on simple arithmetic. It works as follows. You take the values of your polynomial at a fixed interval, here we’ll use the first four values from before:

  \begin{tabular}{l}  0 \\  \\  1 \\  \\  4 \\  \\  9  \end{tabular}

Then you find the distance between each successive pair of them:

  \begin{tabular}{l|l}  0 \\  & 1 \\  1 \\  & 3 \\  4 \\  & 5 \\  9  \end{tabular}

These are the first differences. Then you find the distance between the first differences the same way:

  \begin{tabular}{l|l|l}  0 \\  & 1 \\  1 && 2\\  & 3 \\  4 && 2\\  & 5 \\  9  \end{tabular}

These are the second differences. And one last time to get the third differences:

  \begin{tabular}{l|l|l|l}  0 \\  & 1 \\  1 && 2\\  & 3 && 0\\  4 && 2\\  & 5 \\  9  \end{tabular}

You can see the similarity to differentiation: it’s a polynomial of degree 2 so the first differences increase linearly just like the first derivative, the second differences are constant just like the second derivative, and so on. We don’t actually need the third differences, they’ll all be 0 anyway, so I’ll leave those out below.

What’s neat is that once you have these values you can extend the table using nothing but addition. You know the difference between the first derivatives is fixed, 2, so you can get the next first derivative by adding 2 to the previous one. And you know the difference between the function values is the first differences so you can get the next value just by adding the next first difference to the previous function value. Okay maybe it’s easier to explain with a concrete example:

  \begin{tabular}{l|l|l}  0 \\  & 1 \\  1 && 2\\  & 3 \\  4 && 2 \\  & 5 \\  9  \end{tabular}  \quad\to\quad  \begin{tabular}{l|l|l}  0 \\  & 1 \\  1 && 2\\  & 3 \\  4 && 2 \\  & 5 & \tiny{+0} \\  9 && \bf{2}  \end{tabular}  \quad\to\quad  \begin{tabular}{l|l|l}  0 \\  & 1 \\  1 && 2\\  & 3 \\  4 && 2 \\  & 5 \\  9 & \tiny{+2} & \bf{2} \\  & \bf{7} \\  \end{tabular}  \quad\to\quad  \begin{tabular}{l|l|l}  0 \\  & 1 \\  1 && 2\\  & 3 \\  4 && 2 \\  & 5 \\  9 && \bf{2} \\  \tiny{+7} & \bf{7} \\  \bf{16} \\  \end{tabular}

Notice that we don’t need the full table at this point, we only need that for calculating the initial values. All we need to generate more values is the last of each of the differences:

  \begin{tabular}{l|l|l}  \multicolumn{1}{r}{}&&2 \\  & 7 & \tiny{+0}\\  16 &\tiny{+2}& 2 \\  \tiny{+9} & 9 \\  \bf{25} \\  \end{tabular}  \quad\to\quad  \begin{tabular}{l|l|l}  \multicolumn{1}{r}{}&&2 \\  & 9 & \tiny{+0}\\  25 &\tiny{+2}& 2 \\  \tiny{+11} & 11 \\  \bf{36} \\  \end{tabular}  \quad\to\quad  \begin{tabular}{l|l|l}  \multicolumn{1}{r}{}&&2 \\  & 11 & \tiny{+0}\\  36 &\tiny{+2}& 2 \\  \tiny{+13} & 13 \\  \bf{49} \\  \end{tabular}  \quad\to\quad\dots

This provably works for any polynomial. To generate a sequence of values for a polynomial of degree n all you need is n+1 initial values; from the values you calculate the table of differences using subtraction, and from there on you can calculate as many successive values as you like using just addition. You don’t even need to know the closed form of the polynomial as long as you can evaluate the initial values at fixed intervals.

This is the basic principle of the difference engine, and what it’s named after. The engine has 8 integer registers called columns that can each hold a 31-digit integer value which represent the current value of the function and the first to the seventh difference. By cranking the handle those values are added together from right to left. Here is a another mini-emulator, this one calculating the square function using the differences we just calculated:

You can see the values are being added together from right to left and the current function value in the leftmost column is printed for each emulated crank on the handle. Printing was also a part of the original difference engine. A common source of errors in mathematical tables was typesetting so to avoid that step the engine would automatically stamp its output in a soft material that could be used directly for printing, as well as print a log on paper.

Being able to evaluate an integer polynomial is just the beginning though. First of all, integers aren’t enough, we need to be able to evaluate real-valued functions. Secondly, so far we’ve only seen positive values, we also need negatives. Finally, polynomials can be useful in themselves but we’re really more interested in more complex functions like logarithms and trigonometric or astronomical functions. But with a few tricks the difference engine can handle all those things.


First off: how do we use this to evaluate real-valued functions? You use fixed-point decimal numbers. For instance, say we want to plot the square function from before but this time in steps of 0.25:

  \begin{tabular}{lcl|l|l}  f(0) & = & 0 \\  &&& 0.0625 \\  f(0.25) & = & 0.0625 && 0.125 \\  &&& 0.1875 \\  f(0.5) & = & 0.25  \end{tabular}

These are fractional numbers but if you multiply them by 105 we’re back to integers

  \begin{tabular}{lcl|l|l}  10000 f(0) & = & 0 \\  &&& 625 \\  10000 f(0.25) & = & 625 && 1250 \\  &&& 1875 \\  10000 f(0.5) & = & 2500 \\  \end{tabular}

Now we’re back to something the engine can calculate:

I’ve added a line between the digits to mark where the decimal point goes. It also gets added to the output (I don’t believe the original engine did this). But the decimal point is purely a matter of interpretation by the operator, the adding mechanism is not aware of it, it’s operating purely on 6-digit integer values.

In this case we were lucky because there was a relatively small factor of 10 we could multiply onto the values to get integers without losing precision. That’s unlikely to be the case in general. If you can’t use this trick you multiply the values with as large a factor of 10 as you have precision for and just bite the bullet, round the scaled values to the nearest integers and lose some accuracy. That’s not necessarily as bad as it sounds. The original design had 31 digit precision in base 10 which corresponds to roughly 103 bits, well beyond the already quite large 64-bit integers on most modern machines. So you can afford to scale the values up quite a bit before rounding. We’ll see an example in a bit of how long it takes for errors to accumulate.

Negative values

To represent negative values we use the exact same trick as with binary numbers, just in base 10: tens complement. A negative number d is represented as 10n – d where n is the number of decimal digits, so -1 is represented by 999…999 and so forth. The adding mechanism itself has no concept of negative values, just like with twos complement the correct behavior just falls out of how overflow works. It’s up to the operator or printer to interpret the output correctly as signed or unsigned values.

Here is an example of a function that starts off positive, peaks, and then descends into the negative numbers.

You’ll notice that the numbers in the columns are all positive but the ones that represent negative values are printed as negative. As with the decimal point that’s a hack I added in the printer which makes it print smaller integer values as positive and larger ones as negative. But it’s purely a matter of interpretation, the calculating part of the engine is oblivious.

Polynomial approximation

Okay, now we’re getting close to the goal: being able to produce accurate arithmetic tables of general functions.

In my previous post the main example of how important mathematical tables were was how astronomical tables were used by mariners to navigate by lunar distance. I don’t actually know the math underlying those astronomical tables so here I’ll use an example I do understand: trigonometric functions. On the right here is a table of trigonometric functions from 1785. It gives 7 digits of the values of 8 different trignonmetric functions, sin, cos, tan, etc., for each arcminute between 0° and 45°. There’s 60 arcminutes to one degree so that’s 2700 values for each function, 21,600 values in total. The author of this table, Charles Hutton, said about this edition in a later one that it had so many errors that is was basically useless:

Finding, as well from the report of others, as from my own experience, that those editions […] were so very incorrectly printed, the errors being multiplied beyond all tolerable bounds, and no dependence to be placed on them for any thing of real practice […]

For this last part about how to calculate complex functions I’ll show how to replicate one column, sin, of this table.

Since we know how to evaluate polynomials the solution that first springs to mind is to approximate the function we want to calculate using a polynomial. Taylor polynomials were well known at this time so that’s an obvious approach. Taylor’s theorem says that for an infinitely differentiable function f (which sin is),

f(x) = \sum_{k=0}^{\infty}\frac{f^{(k)}(a)}{k!}(x-a)^k

where f(k) means f differentiated n times and a is any point on the function. Since the engine only has 8 columns and we need n+1 columns for a polynomial of degree n we have to limit ourselves to at most the first 7 terms. And in fact, since I want to demonstrate this with the emulator in a little bit and 8 columns of 31 digits takes up an enormous amount of space we’ll limit ourselves even more in this example to 4 columns of 13 digits. This means that we’ll use only the first 3 terms of the Taylor polynomial. For sin those are:

\sin(x) \approx x - \frac{x^3}{3!}

(Conveniently all the even degree terms are 0). This approximates sin quite well around 0 so we’ll use that as the basis for generating the table.

To calculate the differences we first need to produce n+1 values at fixed intervals. The generated table should have an entry per arcminute so we’ll start at 0′ and do steps of 1′:

x sin(x)
0′ 0
1′ 2.909 10-4
2′ 5.818 10-4
3′ 8.727 10-4

Then we need to find the nth differences:

x sin(x) Δ1 Δ2 Δ3
0′ 0 2.909 10-4 -2.461 10-11 -2.461 10-11
1′ 2.909 10-4 2.909 10-4 -4.923 10-11
2′ 5.818 10-4 2.909 10-4
3′ 8.727 10-4

All of this is a matter of evaluating a polynomial so that’s not super hard to do by hand with as many decimals as you need, as long as you only need a few of them. From this table we take the last of each of the differences and that’ll be the starting point for the calculation:


At this point we need to decide how much accuracy we want, that is, where we want the fixed decimal point to be. We have 13 digits which gives us room enough to multiply by 1013 before rounding. That gives us these integer values:


And now we’re ready to get tabulating:

If you follow along with the original table you can see that it generates exactly the same values. The values generated by the engine continue to match the table values until 1° 1′, 57 values in, where the table says 0.0177432 and the engine produces 0.0177433, and after that it continues to produce matching values up until 1° 53′, more than 100 values in.

Not bad right? And remember, this is a simplified emulator that can only calculate the third degree approximation where the original could go up to the seventh, and only with 13 digits of precision where the original had 31.

So what’s the source of the deviation? There’s two: the approximating polynomial and the accumulating error of the engine. Let’s first look at the polynomials.

The first plot on the right is of how quickly the approximating polynomials of different degrees deviate from the true sine function. At 1° the approximations are still well within 0.01 billionth of the correct value. Like I said, near 0 these polynomials approximate sin really well.

This suggests that the main source of inaccuracy is the engine itself, the precision we had to discard when fixing the decimal point, and as you can see in the second graph, it is. The engine loses precision faster than the polynomial by a large factor. This makes sense because the inaccuracy accumulates for every step.

Luckily in this case the polynomial deviates slowly enough we can use it to calculate new almost-accurate initial values at fixed intervals, for instance for each degree, and reset the machine to those. However, eventually the polynomial itself will deviate too much and at that point we can use the fact that the Taylor polynomial has an a parameter that specifies the point around which we’re approximating. So say the polynomial that approximates around 0° becomes too inaccurate at 6° we can derive a Taylor polynomial around 6° and use that to continue the calculation. Indeed, since the polynomial approximates equally well on both sides of the point we might as well approximate around 9° and use it for all values between 6° and 12°.

Sin is a relatively easy function to approximate in this way, a function such as log is harder but the same basic principles apply to harder functions. It’s a matter of how often you need to reset to get rid of the accumulating errors and how long the same approximating polynomial remains accurate.

One of the weak points of the engine is that even though it requires less manual work than producing a table completely manually, there’s still a fair amount of manual analysis and computation to be done. That’s not a killer in itself though. Even if it took just as much work to operate, which it surely wouldn’t have, just having a different way to create these tables would have been immensely valuable since two different approaches are likely to produce different errors and hence can be used to cross-check each other. But as this illustrates, even if the engine had been built it was definitely not a matter of just plugging in a few values and then going to lunch, using it involved a fair amount of work4.

The revolution that didn’t happen

Here’s a video of the only difference engine ever built which was completed in 2002.

Babbage tried to build it and ultimately gave up for a number of reasons including family problems, financial problems, problems working with his engineer, and his gradual change of focus to the analytical engine. Despite what you often hear the technology did exist to build it; the modern one was built from his designs and only with technology that would have been available back then.

It also appears that Babbage’s interests and those of the English government who paid for the whole thing were just too far apart. Babbage was interested in the machine itself whereas his sponsors just wanted accurate tables, whatever way Babbage could produce them. It’s a shame really. It seems from what I’ve read that the difference engine was a failure not of vision or technology but of product management. The technology was so promising that if a successful prototype had been built and he’d delivered the tables the English government wanted it’s not unlikely that they would have continued to fund research in more advanced engines. The ground would have been fertile for mechanical calculation on a larger scale by the mid 1800s. Obviously that wouldn’t have meant a mechanical iPad in every home by 1900 but it would certainly have been a better outcome than what happened, that the designs went in the drawer.

Ultimately Babbage moved on to the analytical engine, the first ever programmable computer. My next post will be about that and in particular the first ever software programs which were written for it.


For more information about the difference engine a quick introduction is given in the first part of Menabrea’s Sketch of the Analytical Engine from 1842. A more detailed description, including of the underlying mechanics of the engine, can be found in Lardner’s Babbage’s Calculating Engine from the July 1834 edition of the Edinburgh Review.


1: While the emulator performs the same type of calculations as the difference engine it actually looks nothing like it. I made a point to give the emulator a mechanical feel but it’s inspired more by the Curta calculator, the mechanical calculator I know best, not the difference engine. Note also that I have only tested it on the browsers I have easy access to, Chrome, Firefox, Opera, and Safari on mac. If it doesn’t work on your platform the code lives on github and patches are welcome.

2: Ada Lovelace in her famous notes about it is almost contemptuous of how simple it is compared to the analytical engine:

The Difference Engine can in reality […] do nothing but add; and any other process […] can be performed by it only just to the extent in which it is possibly, by judicious mathematical arrangement and artifices, to reduce them to a series of additions.

3: Incidentally, much of the underlying math was developed before Babbage by J. H. Müller.

4: I wonder if it’s possible to automate some of the manual work involved in operating the engine using the engine itself, like calculating initial values.


I have a thing for mechanical calculators and it recently occurred to me that I knew almost nothing about two of the most famous ones: Babbage’s difference engine and analytical engine. This led me to read some of the papers from the mid 1800s that were written about them. This blog post is the first of a few I’m planning to write about that.

The analytical engine usually gets most of the attention but the difference engine is an interesting invention in its own right. Not only did it solve an important problem, it is the only one of the two that was complete enough to actually be built. This post about what made the difference engine so important that Babbage spent decades trying to build it and why British government was willing to pay the bill of over ₤17,000, more than the price of two warships.


Today computation is cheap. Extremely cheap. Imagine the amount of math that goes into just displaying the image on your screen right now: the layouts, colors, and fonts, rendering it all on a physical display, and doing it again and again quickly and smoothly enough that you don’t even notice it’s happening.

Computation is so cheap that it’s easy to forget how expensive it was before electronic calculators. It used to be that if you wanted to add two numbers together you had to actually add those numbers together. Manually. Need to multiply or divide two numbers, even just a few digits? Then you’ll have to get the paper out and do long multiplication or long division. I just did a long multiplication to make the image on the right here. I got it wrong twice before getting it right and I went from “this’ll be fun, I wonder if I still remember how to do this” to “god this is so tedious” in about 30 seconds.

And those are just the basic building blocks of doing a calculation. Most interesting computations like calculating interest or the position of the moon in six months require you to do these manual computations over and over and over again. Or require operations that you can’t easily calculate by hand, like trigonometric functions.

At this point you might be thinking: who cares where the moon is in six months? It turns out, back in those days a lot of people did. In some cases people’s lives depended on it.

Lunar Navigation

On the right here is a table of distances in degrees on the night sky from the center of the moon to various stars at particular times. The first line gives the distance between the center of the moon and Aldebaran on March 3, 1775 at noon, 3, 6, and 9 o’clock. Multiply that by 365 days, then multiply it by a dozen stars, that gives you just some of the tables in this book, the first edition of the Nautical Almanc and Astronomical Ephemeris from 1774, published from the Royal Greenwich Observatory. The audience for the almanac were mariners. The first edition of 10,000 copies sold out immediately.

To determine your longitude at sea you need to know the current time at a fixed point. You can think of it sort of like navigating with time zones. If you know it’s 4 o’clock in the afternoon Greenwich and it’s noon where you are (which you can tell by looking at the sun) then you know you’re in the -4 time zone which is the one that goes through eastern Canada, the eastern Caribbean and central South America. This is a rough analogy but that’s the gist of how it works.

Up until around 1850, before accurate clocks were made that could be carried on long voyages, a reliable way to determine the current time was using lunar distance. The moon and stars in the night sky move as a perfectly predictable clockwork. A given configuration occurs only once, and you can calculate in advance precisely what the sky is going to look like at a later time. And more importantly you can go the other way: given the precise configuration of the sky you can calculate exactly what time it is.

Actually you don’t need the full configuration; all you need to know to calculate the time is the distance in degrees from the center of the moon to any star. That’s where the almanac comes in. It precomputes those distances so that all a navigator needs to do is measure the angle (typically using a sextant) and then look the value up in the almanac. Okay that’s actually just the basic principle, there’s a lot more to it in practice: you have to adjust for the distance from the center of the moon to the circumference, for your position on the earth, for atmospheric refraction, etc. Being a navigator takes a lot of skill. How do you make those adjustments by the way? More tables of course.

All this means that having accurate tables is extremely important. An undetected error in the almanac means a navigation error which can mean shipwreck. This is made worse because many of these tables are time dependent: one line in the almanac is useful on one day only. As a navigator you’re basically beta testing the data for every single day because nobody has had any reason to use the data before.

There are many sources of errors in numerical tables. Teams of human computers carried out the manual calculations, a tedious and error prone process. (Incidentally, it turns out that the better an understand you have of the calculation you’re carrying out the more likely you are to make mistakes as a computer.) Often the same value would be calculated by more than one human computer and then compared to catch errors – but checking is an error prone process in itself, and computers can (and did) copy from each other. Then finally someone has to manually set the values in movable type and print them, also an obvious source of errors.

Enter Charles Babbage.


Babbage was an unorthodox and very gifted mathematician. He was a fan of Leibniz which was still something of a heresy at his college Trinity, home of Newton, Leibniz’s arch rival. He was also one of the founders of the Analytical Society whose goal it was to replace Newton’s formalism for calculus with Leibniz’s. Incidentally, besides inventing calculus independently from Newton Leibniz designed a mechanical calculating machine, the stepped reckoner.

Babbage recognized the problem of calculating tables, as most people did, but also had a solution: the difference engine. The idea behind the difference engine is that most of the functions you want to create tables for can be approximated by a polynomial. Here is the sine function along with three approximating polynomials of increasing degree:

As the degree of the polynomial increases the approximation quickly becomes better – the degree-seven polynomial is quite close:

f_7(x) = x - \frac{x^3}{3!} + \frac{x^3}{5!} - \frac{x^7}{7!}

Babbage’s idea was to use mechanical means to calculate the approximating polynomials with high accuracy not just print the result on paper but do the actual typesetting to eliminate even the typographer as a source of errors.

But I’ll stop here before we get to the juicy details of how the difference engine works and save that for my next blog post.

0x5f3759df (appendix)

This post is an appendix to my previous post about 0x5f3759df. It probably won’t make sense if you haven’t read that post (and if you’ve already had enough of  0x5f3759df you may consider skipping this one altogether).

One point that bugged me about the derivation in the previous post was that while it explained the underlying mathematics it wasn’t very intuitive. I was basically saying “it works because of, umm, math“. So in this post I’ll go over the most abstract part again (which was actually the core of the argument). This time I’ll explain the intuition behind what’s going on. And there’ll be some more graphs. Lots of them.

The thing is, there is a perfectly intuitive explanation for why this works. The key is looking more closely at the float-to-int bit-cast operation. Let’s take that function,

int float_to_int(float f) {
  return *(int*)&f;

and draw it as a graph:

This is a graph of the floating-point numbers from around 1 to 10 bit-cast to integers. These are very large integers (L is 223 so 128L is 1 073 741 824). The function has visible breaks at powers of two and is a straight line inbetween. This makes sense intuitively if you look at how you decode a floating-point value:

v = (1+m)2^e

The mantissa increases linearly from 0 to 1; those are the straight lines. Once the mantissa reaches 1 the exponent is bumped by 1 and the mantissa starts over from 0. Now it’ll be counting twice as fast since the increased exponent gives you another factor of 2 multiplied onto the value. So after a break the straight line covers twice as many numbers on the x axis as before.

Doubling the distance covered on the x axis every time you move a fixed amount on the y axis – that sounds an awful lot like log2. And if you plot float_to_int and log2 (scaled appropriately) together it’s clear that one is an approximation of the other.

In other words we have that

\mathtt{float\_to\_int}(v) \approx (\log_2(v)+B)L

This approximation is just a slightly different way to view the linear approximation we used in the previous post. You’ll notice that the log2 graph overshoots float_to_int in most places, they only touch at powers of 2. For better accuracy we’ll want to move it down a tiny bit – you’ll already be familiar with this adjustment: it’s σ. But I’ll leave σ out of the following since it’s not strictly required, it only improves accuracy.

You can infer the approximation formula from the above formula alone but I’ll do it slightly differently to make it easier to illustrate what’s going on.

The blue line in the graph above,  float_to_int(v), is the i in the original code and the input to our calculation. We want to do integer arithmetic on that value such we get an output that is approximately

\mathtt{float\_to\_int}(v^{-\frac 12})

the integer value that, if converted back to a float, gives us roughly the inverse square root of x. First, let’s plot the input and the desired output in a graph together:

The desired output is decreasing because it’s a reciprocal so the larger the input gets the smaller the output. They intersect at v=1 because the inverse square root of 1 is 1.

The first thing we’ll do to the input is to multiply it by -1 to make it decrease too:

Where before we had a very large positive value we now have is a very large negative one. So even though they look like they intersect they’re actually very far apart and on different axes: the left one for the desired output, the right one for our work-in-progress approximation.

The new value is decreasing but it’s doing it at a faster rate than the desired output. Twice as fast in fact. To match the rates we multiply it by 1/2:

Now the two graphs are decreasing at the same rate but they’re still very far apart: the target is still a very large positive value and the current approximation is a half as large negative value. So as the last step we’ll add 1.5LB to the approximation, 0.5LB to cancel out the -1/2 we multiplied it with a moment ago and 1LB to bring it up to the level of the target:

And there it is: a bit of simple arithmetic later and we now have a value that matches the desired output:

\mathtt{float\_to\_int}(v^{-\frac 12}) \approx \frac 32LB - \frac 12 \mathtt{float\_to\_int}(v)

The concrete value of the constant is

\frac 32LB = \mathtt{0x5f400000}

Again the approximation overshoots the target slightly the same way the log2 approximation did before. To improve accuracy you can subtract a small σ from B. That will give a slightly lower constant – for instance 0x5f3759df.

And with that I think it’s time to move onto something else. I promise that’s the last you’ll hear from me on the subject of that constant.


This post is about the magic constant 0x5f3759df and an extremely neat hack, fast inverse square root, which is where the constant comes from.

Meet the inverse square root hack:

float FastInvSqrt(float x) {
  float xhalf = 0.5f * x;
  int i = *(int*)&x;         // evil floating point bit level hacking
  i = 0x5f3759df - (i >> 1);  // what the fuck?
  x = *(float*)&i;
  x = x*(1.5f-(xhalf*x*x));
  return x;

What this code does is calculate, quickly, a good approximation for


It’s a fairly well-known function these days and first became so when it appeared in the source of Quake III Arena in 2005. It was originally attributed to John Carmack but turned out to have a long history before Quake going back through SGI and 3dfx to Ardent Computer in the mid 80s to the original author Greg Walsh. The concrete code above is an adapted version of the Quake code (that’s where the comments are from).

This post has a bit of fun with this hack. It describes how it works, how to generalize it to any power between -1 and 1, and sheds some new light on the math involved.

(It does contain a fair bit of math. You can think of the equations as notes – you don’t have to read them to get the gist of the post but you should if you want the full story and/or verify for yourself that what I’m saying is correct).


Why do you need to calculate the inverse of the square root – and need it so much that it’s worth implementing a crazy hack to make it fast? Because it’s part of a calculation you do all the time in 3D programming. In 3D graphics you use surface normals, 3-coordinate vectors of length 1, to express lighting and reflection. You use a lot of surface normals. And calculating them involves normalizing a lot of vectors. How do you normalize a vector? You find the length of the vector and then divide each of the coordinates with it. That is, you multiply each coordinate with


Calculating x^2+y^2+z^2 is relatively cheap. Finding the square root and dividing by it is expensive. Enter FastInvSqrt.

What does it do?

What does the function actually do to calculate its result? It has 4 main steps. First it reinterprets the bits of the floating-point input as an integer.

int i = *(int*)&x;         // evil floating point bit level hack

It takes the resulting value and does integer arithmetic on it which produces an approximation of the value we’re looking for:

i = 0x5f3759df - (i >> 1);  // what the fuck?

The result is not the approximation itself though, it is an integer which happens to be, if you reinterpret the bits as a floating point number, the approximation. So the code does the reverse of the conversion in step 1 to get back to floating point:

x = *(float*)&i;

And finally it runs a single iteration of Newton’s method to improve the approximation.

x = x*(1.5f-(xhalf*x*x));

This gives you an excellent approximation of the inverse square root of x. The last part, running Newton’s method, is relatively straightforward so I won’t spend more time on it. The key step is step 2: doing arithmetic on the raw floating-point number cast to an integer and getting a meaningful result back. That’s the part I’ll focus on.

What the fuck?

This section explains the math behind step 2. (The first part of the derivation below, up to the point of calculating the value of the constant, appears to have first been found by McEniry).

Before we can get to the juicy part I’ll just quickly run over how standard floating-point numbers are encoded. I’ll just go through the parts I need, for the full background wikipedia is your friend. A floating-point number has three parts: the sign, the exponent, and the mantissa. Here’s the bits of a single-precision (32-bit) one:

s e e e e e e e e m m m m m m m m m m m m m m m m m m m m m m m

The sign is the top bit, the exponent is the next 8 and the mantissa bottom 23. Since we’re going to be calculating the square root which is only defined for positive values I’m going to be assuming the sign is 0 from now on.

When viewing a floating-point number as just a bunch of bits the exponent and mantissa are just plain positive integers, nothing special about them. Let’s call them E and M (since we’ll be using them a lot). On the other hand, when we interpret the bits as a floating-point value we’ll view the mantissa as a value between 0 and 1, so all 0s means 0 and all 1s is a value very close to but slightly less than 1. And rather than use the exponent as a 8-bit unsigned integer we’ll subtract a bias, B, to make it a signed integer between -127 and 128. Let’s call the floating-point interpretation of those values e and m. In general I’ll follow McEniry and use upper-case letters for values that relate to the integer view and and lower-case for values that relate to the floating-point view.

Converting between the two views is straightforward:

m = \frac{M}{L}

e = E - B

For 32-bit floats L is 223 and B is 127. Given the values of e and m you calculate the floating-point number’s value like this:


and the value of the corresponding integer interpretation of the number is

M + LE

Now we have almost all the bits and pieces I need to explain the hack so I’ll get started and we’ll pick up the last few pieces along the way. The value we want to calculate, given some input x, is the inverse square root or

y = \frac{1}{\sqrt{x}} = x^{-\frac 12}

For reasons that will soon become clear we’ll start off by taking the base 2 logarithm on both sides:

\log_2 y = {-\frac 12}\log_2 x

Since the values we’re working with are actually floating-point we can replace x and y with their floating-point components:

\log_2 (1+m_y) + e_y = {-\frac 12}(\log_2 (1+m_x) + e_x)

Ugh, logarithms. They’re such a hassle. Luckily we can get rid of them quite easily but first we’ll have to take a short break and talk about how they work.

On both sides of this equation we have a term that looks like this,

\log_2(1 + v)

where v is between 0 and 1. It just so happens that for v between 0 and 1, this function is pretty close to a straight line:

ln2(1 + x) vs. x + sigma

Or, in equation form:

\log_2(1 + v) \approx v + \sigma

Where σ is a constant we choose. It’s not a perfect match but we can adjust σ to make it pretty close. Using this we can turn the exact equation above that involved logarithms into an approximate one that is linear, which is much easier to work with:

m_y + \sigma + e_y \approx {-\frac 12}(m_x + \sigma + e_x)

Now we’re getting somewhere! At this point it’s convenient to stop working with the floating-point representation and use the definitions above to substitute the integer view of the exponent and mantissa:

\frac{M_y}{L} + \sigma + E_y - B \approx {-\frac 12}(\frac{M_x}{L} + \sigma + E_x - B)

If we shuffle these terms around a few steps we’ll get something that looks very familiar (the details are tedious, feel free to skip):

\frac{M_y}{L} + E_y \approx {-\frac 12}(\frac{M_x}{L} + \sigma + E_x - B) - \sigma + B

\frac{M_y}{L} + E_y \approx {-\frac 12}(\frac{M_x}{L} + E_x) - \frac{3}{2}(\sigma - B)

M_y + LE_y \approx {\frac 32}L(B - \sigma) - {\frac 12}(M_x + LE_x)

After this last step something interesting has happened: among the clutter we now have the value of the integer representations on either side of the equation:

\mathbf{I_y} \approx {\frac 32}L(B - \sigma) - {\frac 12}\mathbf{I_x}

In other words the integer representation of y is some constant minus half the integer representation of x. Or, in C code:

i = K - (i >> 1);

for some K. Looks very familiar right?

Now what remains is to find the constant. We already know what B and L are but we don’t have σ yet. Remember, σ is the adjustment we used to get the best approximation of the logarithm, so we have some freedom in picking it. I’ll pick the one that was used to produce the original implementation, 0.0450465. Using this value you get:

{\frac 32}L(B - \sigma) = {\frac 32}2^{23}(127 - 0.0450465) = 1597463007

Want to guess what the hex representation of that value is? 0x5f3759df. (As it should be of course, since I picked σ to get that value.) So the constant is not a bit pattern as you might think from the fact that it’s written in hex, it’s the result of a normal calculation rounded to an integer.

But as Knuth would say: so far we’ve only proven that this should work, we haven’t tested it. To give a sense for how accurate the approximation is here is a plot of it along with the accurate inverse square root:

Graph of approximation vs. accurate value

This is for values between 1 and 100. It’s pretty spot on right? And it should be – it’s not just magic, as the derivation above shows, it’s a computation that just happens to use the somewhat exotic but completely well-defined and meaningful operation of bit-casting between float and integer.

But wait there’s more!

Looking at the derivation of this operation tells you something more than just the value of the constant though. You will notice that the derivation hardly depends on the concrete value of any of the terms – they’re just constants that get shuffled around. This means that if we change them the derivation still holds.

First off, the calculation doesn’t care what L and B are. They’re given by the floating-point representation. This means that we can do the same trick for 64- and 128-bit floating-point numbers if we want, all we have to do is recalculate the constant which it the only part that depends on them.

Secondly it doesn’t care which value we pick for σ. The σ that minimizes the difference between the logarithm and x+σ may not, and indeed does not, give the most accurate approximation. That’s a combination of floating-point rounding and because of the Newton step. Picking σ is an interesting subject in itself and is covered by McEniry and Lomont.

Finally, it doesn’t depend on -1/2. That is, the exponent here happens to be -1/2 but the derivation works just as well for any other exponent between -1 and 1. If we call the exponent (because e is taken) and do the same derivation with that instead of -1/2 we get:

\mathbf{I_y} \approx (1 - p)L(B - \sigma) + p\mathbf{I_x}

Let’s try a few exponents. First off p=0.5, the normal non-inverse square root:

\mathbf{I_y} \approx K_{\frac 12} + {\frac 12}\mathbf{I_x}

K_{\frac 12} = {\frac 12}L(B - \sigma) = {\frac 12}2^{23}(127 - 0.0450465) = \mathtt{0x1fbd1df5}

or in code form,

i = 0x1fbd1df5 + (i >> 1);

Does this work too? Sure does:

Graph of approximation vs. accurate

This may be a well-known method to approximate the square root but a cursory google and wikipedia search didn’t suggest that it was.

It even works with “odd” powers, like the cube root

\mathbf{I_y} \approx K_{\frac 13} + {\frac 13}\mathbf{I_x}

K_{\frac 13} = {\frac 23}L(B - \sigma) = {\frac 23}2^{23}(127 - 0.0450465) = \mathtt{0x2a517d3c}

which corresponds to:

i = (int) (0x2a517d3c + (0.333f * i));

Since this is an odd factor we can’t use shift instead of multiplication. Again the approximation is very close:

Graph of approximation vs. accurate

At this point you may have noticed that when changing the exponent we’re actually doing something pretty simple: just adjusting the constant by a linear factor and changing the factor that is multiplied onto the integer representation of the input. These are not expensive operations so it’s feasible to do them at runtime rather than pre-compute them. If we pre-multiply just the two other factors:

L(B - \sigma) = 2^{23}(127 - 0.0450465) = \mathtt{0x3f7a3bea}

we can calculate the value without knowing the exponent in advance:

i = (1 - p) * 0x3f7a3bea + (p * i);

If you shuffle the terms around a bit you can even save one of multiplications:

i = 0x3f7a3bea + p * (i - 0x3f7a3bea);

This gives you the “magic” part of fast exponentiation for any exponent between -1 and 1; the one piece we now need to get a fast exponentiation function that works for all exponents and is as accurate as the original inverse square root function is to generalize the Newton approximation step. I haven’t looked into that so that’s for another blog post (most likely for someone other than me).

The expression above contains a new “magical” constant,  0x3f7a3bea. But even if it’s in some sense “more magical” than the original constant, since that can be derived from it, it depends on an arbitrary choice of σ so it’s not universal in any way. I’ll call it Cσ and we’ll take a closer look at it in a second.

But first, one sanity check we can try with this formula is when p=0. For a p of zero the result should always be 1 since x0 is 1 independent of x. And indeed the second term falls away because it is multiplied by 0 and so we get simply:

i = 0x3f7a3bea;

Which is indeed constant – and interpreted as a floating-point value it’s 0.977477 also known as “almost 1” so the sanity check checks out. That tells us something else too: Cσ actually has a meaningful value when cast to a float. It’s 1; or very close to it.

That’s interesting. Let’s take a closer look. The integer representation of Cσ is

C_\sigma = L(B - \sigma) = LB - L\sigma

This is almost but not quite the shape of a floating-point number, the only problem is that we’re subtracting rather than adding the second term. That’s easy to fix though:

LB - L\sigma = LB - L + L - L\sigma = L(B - 1) + L(1 - \sigma)

Now it looks exactly like the integer representation of a floating-point number. To see which we’ll first determine the exponent and mantissa and then calculate the value, cσ. This is the exponent:

e_{c_\sigma} = (E_{C_\sigma} - B) = (B - 1 - B) = -1

and this is the mantissa:

m_{c_\sigma} = \frac{M_{C_\sigma}}{L} = \frac{L(1 - \sigma)}{L} = 1 - \sigma

So the floating-point value of the constant is (drumroll):

c_\sigma = (1 + m_{c_\sigma})2^{e_{c_\sigma}} = \frac{1 + 1 - \sigma}2 = 1 - \frac{\sigma}2

And indeed if you divide our original σ from earlier, 0.0450465, by 2 you get 0.02252325; subtract it from 1 you get 0.97747675 or our friend “almost 1” from a moment ago. That gives us a second way to view Cσ, as the integer representation of a floating-point number, and to calculate it in code:

float sigma = 0.0450465;
float c_sigma = 1 - (0.5f * sigma);
int C_sigma = *(*int)&c_sigma;

Note that for a fixed σ these are all constants and the compiler should be able to optimize this whole computation away. The result is 0x3f7a3beb – not exactly 0x3f7a3bea from before but just one bit away (the least significant one) which is to be expected for computations that involve floating-point numbers. Getting to the original constant, the title of this post, is a matter of multiplying the result by 1.5.

With that we’ve gotten close enough to the bottom to satisfy at least me that there is nothing magical going on here. For me the main lesson from this exercise is that bit-casting between integers and floats is not just a meaningless operation, it’s an exotic but very cheap numeric operation that can be useful in computations. And I expect there’s more uses of it out there waiting to be discovered.

(Update: I’ve added an appendix about the intuition behind the part where I shuffle around the floating-point terms and the integer terms appear.)


TextMateThis post is about one of those messages you see every day of your life as a programmer, your editor telling telling you: this file has changed, would you like to reload or keep your changes? The problem with this particular message is that usually neither option is what you want. You’ve made changes to that file for a reason and you neither want to lose those changes or go out of sync with the file system. There’s a lot of new editors about these days (which is fantastic) but I’ve yet to see one make an attempt to solve this problem. This blog post gives some thoughts on how it could be solved.

The problem

EmacsThe underlying issue here is that editors assume they have exclusive access to all files you are editing. Some editors including eclipse and emacs consider you to be editing a file as soon as you’ve opened it, before you’ve even made any changes. And once you have unsaved changes I don’t know of any editors that are forgiving if the file changes on disk in the meantime.

This assumption, that an editor can have exclusive access to a file, doesn’t seem sound. Just because you have a file open in an editor doesn’t mean it can’t change. If you’re working on a busy code base there could be people editing the same files as you at any time. Usually this is controlled through version control so you have control over when you sync and can make sure you don’t have unsaved changes when you do, but this is something you have to manage yourself. You get no help from your editor. Worse is if you’re working on something time critical, like configuration files that control a live application. These kinds of files can be very busy and you’ll want to both stay in sync and be able to queue up and submit changes quickly. In both cases, but especially the latter, the revert/override choice is really unhelpful.

XCodeAnd these problems is normal old-fashioned use of source code. Besides the emergence of new editors people are also busy working on web-based IDEs. This further undermines the assumption that an editor can own a file. As soon as you’re on the web it becomes much easier to share and collaborate. It may be desirable to support more fine-grained real-time collaboration in addition to the more coarse-grained version control model we’re used to, where modifications come in larger self-contained changesets. Real-time collaboration is not appropriate in many cases, but there are cases where it makes sense and it definitely shouldn’t be the capabilities of the editor that decides whether it’s feasible. If you want to use it the editor should support it, on the desktop or on the web.

Living source files

EclipseThe good news is that this issue has long been solved in the context of collaborative online tools like google docs and google wave (I know there are loads of others but these are the ones I know best). These kinds of tools can’t assume that you have exclusive access to the document you’re editing, there can be any number of entities modifying a document at the same time. They also can’t assume that what you’re seeing is the current contents of the document. In fact in this setting the concept of an authoritative “current document” isn’t even well defined – as long as each participant keeps editing and their changes take a while to propagate the participants may all see slightly different versions of the document, none of which is more authoritative than the others. Only once everyone has stopped editing and everyone has caught up to each others changes does the document stabilize and you can talk about authoritative current contents of the file.

Sublime TextThis model may sound exotic and not appropriate for editing local files, and it may not be. But the underlying assumptions of those models, that what you’re seeing may not reflect the current contents of the file and that those contents may change at any time, are good assumptions to make even for local file editors. The next part is about how to make an editor work under those assumptions.


How do online collaborative tools work then? The ones I’m familiar with use operational transformation, or OT. The way OT works is by changing the focus from the question what are the contents of this file? to what operations have been performed on the contents of this file? This slight change in viewpoint makes a huge difference.

In a (very simplified) non-OT model your interaction with a source file in an editor works something like this:

  • When a file is opened, read the current contents into a buffer.
  • Any changes the user makes are applied directly (destructively) to the buffer.
  • When user asks for the file to be saved the current contents of the buffer are written directly into the file. If the file has changed those changes would be lost so throw up a revert/override dialog and let the user choose.

In the OT model the interaction would work something along these lines. (This is a very rough outline, a proper OT model would be much cleaner than this makes it sound):

  • When a file is opened, read the current contents into a buffer.
  • Any changes the user makes are stored as operations, essentially a diff against the original file contents.
  • When the user asks the editor to save the file it does the following:
    • Read the current contents of the file.
    • Generate a three-way diff between the original contents, the current contents, and the pending operations.
    • Apply the resulting diff to the current file contents.
    • Save the results.

While this isn’t completely accurate it gives a sense of the basics: what you’re doing is not editing a file, though it looks just like it, you’re building a set of changes against the file contents from one point in time. The changes you build may be applied to the original file contents, or they may be applied to a different version if the file changes. This model is much more robust to the way people actually work with source code. And though you may not always want your editor to merge changes on the fly when you save wouldn’t it be nice if it would at least give you the option?

Beyond just saving this model has several other nice properties. When the system notices that a file has changed, even before you want to save it, it can update your buffer by doing essentially the same as when saving except that instead of applying your pending changes to the current file contents it would apply the changes between the old and current file contents to the buffer containing your pending changes. It also allows you to undo saving since saving is equivalent to applying a diff and you can undo that, even when there’s been subsequent changes, just by reverting the diff.

There are probably numerous other solutions to this issue. I’m most familiar with OT because I worked with when I was on the wave project and I know it works extremely well. (I wish I had a good place to point anyone interested in learning the basics but I don’t, only the wikipedia page listed above). My main point is that this issue should be solved somehow, and it’s a problem that’s been solved in other contexts. I think it’s about time those solutions make their way to desktop editors.


Came across an article about information overload on google+, which was very apropos of my recent efforts to not be so stressed at work. I’ve made some changes to how I work over the last few months, some of them just tweaks actually, and I feel like they’ve improved my ability to focus and made work much less stressful. YMMV, obviously.

  • I use virtual desktops to minimize distraction. I have three virtual desktops: one desktop with just what I’m working on right now, nothing else, one with email, code reviews, and other things that might distract me, and one with personal stuff like personal email. I can easily switch desktops and check emails, but the only thing that can impose on me are chat popups, and people don’t chat me a lot. I occasionally check for new important email or code reviews but I can choose when, typically while I’m waiting for something on my work desktop to complete.
  • I keep the tasks I’m working on written down and categorized according to when I want to complete them. I pick one task from the stuff-I-should-do-soon pile and work just on that. When that pile is empty, or it’s been a while since the last time, I look over all my tasks, usually a quarter’s worth, and move some out of and some into the soon-pile. This means I don’t have to worry about all the things I’m not working on right now because I know I’ve considered what I could be working on and chosen what I’m doing right now as the most important. This has been very helpful in making me less stressed. This is apparently a well-known approach, I first saw it in this video about how to use Asana (Using Asana for Individual Task Management).
  • I make an effort to only work 8 hours a day. This is a tough one because it sometimes makes me feel like I’m slacking off, especially because I come in early so I’m almost always the first to go home. However, there’s plenty of research suggesting that even though you may feel you’re getting more work done working 10 hours a day it’s an illusion and you’re actually doing less than if you worked 8 hours (see why crunch mode doesn’t work). At first I just trusted the research, against my own intuition, but at this point I’m convinced — I feel much more focused and efficient and I see no evidence that I’m being less productive. I will occasionally work more for a while but I always feel a “overwork hangover” afterwards where I’m tired, less focused, more likely to space out. But at this point I know that and can make a conscious choice about whether it’s worth it.

Some of these require a bit of motivation; the one really easy thing is to clear any distractions from your workspace, especially if like me you were silly enough to have a browser window there. It’s easy and it makes a big difference.


One really useful programming tool I use whenever it’s available is exported stats monitoring. It’s one of those things that has no standard definition and which people reimplement over and over in different ad-hoc ways. This post is about a monitoring tool for JavaScript programs running in chrome I implemented recently, WebMon.


WebMon is a very simple tool conceptually; this image tells you all you need to know to use it:


It has two parts: a JavaScript library, webmon.js, and a chrome extension, WebMon. In your JavaScript code you export a stat by creating an instance of the webmon.Counter type which is provided by the library. During execution you update the stats as appropriate using the methods on the Counter instances. The WebMon chrome extension detects any page that uses the library and can show you a popup that shows the current value of each exported stat, updated continuously. You can also specify some simple computations on the data, for instance here the frame rate counter is configured to display the rate per second rather than the count. To see this in action try installing the chrome extension and going to this test page. The extension needs pretty broad permissions to see stats on any web page so for the paranoid you may want to skim the source code before installing it.

The motivation for implementing this came from a WebGL-based hobby project I’m working on. I needed to keep track of the frame rate and rendering CPU load and my initial implementation which displayed those in a div on the page was just not working out. WebMon gives you the same functionality with a minimum of complexity and performance impact on the JavaScript side since almost all the code is in the chrome extension. It also doesn’t require you to clutter your page with debug info.

That’s all there is to it. The rest of this post will be about exported stats in general.


WebMon is just one example of the more general mechanism of exported stats. The general model is: a program exports a set of stats which it updates during execution. These stats are visible to some form of external processor/viewer which can record or at least see their value at any time. A simple example is chrome. If you run chrome with the --enable-stats-table command-line flag you can go to chrome://stats and see the current value of a number of chrome’s internal counters and timers. V8 has a similar mechanism but since v8 doesn’t have a UI that can easily display the stats I wrote a python script that continuously reads the counter values and displays them in a window. I used it all the time when working on the v8/chrome integration layer since it allowed me to see exactly how many dom node wrappers were live, exactly what had happened during the last garbage collection, etc.

Much of this you could also get by printing debug information and processing it either manually or, for more complex output, using separate programs. A place this works really well is dumping debug information after garbage collections in a virtual machine. Garbage collections happen rarely enough that it’s not a lot of overhead and you’re likely to want information about every collection, not just the last one. Compared to logging the advantage of using an exported stat is that updating it is cheaper — in the v8 implementation it’s just writing a single word in memory — and it scales well. If you log information about 100 different conditions your log becomes huge and difficult to process, and the I/O slows your program down. With exported stats the external processor or viewer can easily filter the stats for you and only show the ones you’re interested in. Also, the space for each export stat is constant, typically just a single word containing the stat and another few words of metainformation. Another advantage of exported stats is that they’re testable — you can easily read back and test the value of a counter whereas testing that the right thing is printed in a debug log is tricky.

Really I think each programming language should come with stats exporting and monitoring built in or as a canonical library. Or even better, since monitoring and viewing can (and should)  be a separate process there could be a single common monitoring tool with a standard protocol that would have bindings for each language.

Until that happens, if you’d like to know what is happening in your JavaScript program as it happens you should give WebMon a try.


It turns out that Google’s generous supply of free snacks to employees does more than just keep us fat and happy, my efforts to keep my snack intake down led to some interesting math too. In this post I’ll have a bit of fun with Gray code, prove wikipedia wrong (shock!), and get healthier by eating fewer snacks.

It recently occurred to me that my snack counting system was in unary: one piece of sticky tack for one snack. As a self-respecting computer scientist obviously I find this unacceptable – unary is okay for some very limited uses but binary should really always be your first choice. Clearly I had to change to a binary system.

The problem with standard binary numbers though is that I don’t want to have to move to many dots around. It sucks when I’m having my fourth snack of the week that I have to move all 3 dots to go from 011 to 100. Luckily there is a solution to this: Gray code, which allows you to count as efficiently as normal binary but only flips one bit at a time.

To use Gray code though I needed an easy way to count from one number to the next. It’s fine to know that I only need to flip one bit but I also need to know which one to flip. Skimming the wikipedia page I found this recipe for counting:

[…] at step i find the bit position of the least significant ‘1’ in the binary representation of i – flip the bit at that position in the previous code to get the next code.

This is actually incorrect (more on that below) but I didn’t know that when I first read it. I just though “that sounds like a lot of work, surely it doesn’t have to be that difficult”. And it doesn’t if you look at how Gray code1 is constructed.

Gray code is constructed recursively. Given the sequence of Gray codes that count from 0 to n, for instance here are the values that count from 0 to 3:

00 01 11 10

you construct the set of numbers counting from 0 to 2n by adding a zero in front of the previous sequence, using that to count up to n, and replacing the 0 with a 1 and counting the previous sequence backwards to 0:

000 001 011 010 110 111 101 100

This is a really neat construction and makes it easy to prove inductively that they have some of their nice properties: that you always change exactly one bit, that you never hit the same value twice, etc.

This construction also means you can increment a Gray coded number recursively, using this procedure:

  1. If the first digit is 0 and the remaining digits can be incremented, increment them.
  2. If the first digit is 0 and the remaining digits are already at their largest possible value, flip the first digit.
  3. If the first digit is 1 decrement the remaining digits.

The procedure for decrementing is the same, just inversed. This may sound complicated but can be phrased differently in a way that makes it easier to use in practice.

  • You want to flip exactly one bit. Initially you want to flip a bit up, that is, from 0 to 1.
  • You start from the far left, going right, and you must flip the least significant bit that can be flipped the way you want. You can’t flip a 0-bit down since it’s already down, and you can’t flip a 1-bit up.
  • When you meet a 1 you change your mind about which way you want to flip, starting at the next bit. If you wanted to flip up before, going forward you’ll want to flip down and vice versa. But only from the next bit.

I call this the flip-flop2 algorithm because you keep changing your mind about which way you want to flip the bit. Here’s an example of how it works.  Say you want to increment this Gray coded number (it’s happens to be 115 but the concrete value is irrelevant):


You start out wanting to flip a bit up. The first bit is 0 so you can up it, which makes it a candidate for the bit to flip. We’ll mark it like this:


The arrow goes up because we want to flip the bit up, and it’s green because it’s possible to flip it. Moving left, we also want to flip the next bit up but it’s already a 1 so we can’t and we make that one red.


Moving left again, since the last bit we saw was a 1 we now change our minds and want to flip a bit down. The next two bits are both 0 so we can’t down those and mark them red too:


The next bit is a 1 which we can down,


and as usual we change our mind and now want to up. The next bit is a 0 which we can up


And so on: the next bit we can’t up and the last bit we can’t down so both become red. The least significant bit we could flip the way we’d want is the third so that’s the one we’ll flip. The end result is this:

01001010 → 01001110

In practice it’s easy to do. You just scan the number from left to right, keeping track of operation you currently want to do and what the last position was where you could do an operation. Here’s a sequence of the Gray numbers from 115 to 122 with the procedure arrows like above; the bold digit is the one that was changed in the previous step:

01001010 → 01001110 → 01001111 → 0100110
↓↓↓   ↓↓   ↓↓   ↓↓

01001100 → 01000100 → 01000101 → 01000111
↓↓   ↓↓   ↓↓   ↓↓

Besides being a neat way to increment a Gray code the flip-flip algorithm has a few interesting quirks that are worth taking a closer look at. One thing you may have noticed is that each arrow in the diagrams have two binary properties: they each have a direction, up or down, and they each have a color, red or green. If you read these properties as bits they form two different numbers: the direction value where up is 0 and down is 1, and the color value where green is 0 and red is 1. Here is an example:

01001010       Gray code
01110011  115  Color value
00111001  57   Direction value

If you think the color value, 115, looks familiar that’s because is is: it’s the binary representation of the Gray code. I found that pretty surprising when I noticed it. If you look at the color and direction values next to each other it’s also clear that the direction is actually the color right-shifted once. Here are the the color and direction values spelled out for the sequence above:

01001010 → 01001110 → 01001111 → 0100110
↓↓↓   ↓↓   ↓↓   ↓↓
01110011   01110100   01110101   01110110
00111001   00111010   00111010   00111011

01001100 → 01000100 → 01000101 → 01000111
↓↓   ↓↓   ↓↓   ↓↓
01110111   01111000   01111001   01111010
00111011   00111100   00111100   00111101

The intuitive reason for this becomes clearer if you interpret the procedure slightly differently. In the original phrasing you want to either change a 0 to 1 or a 1 to 0, and you keep changing your mind about which you want to do. An alternative view is to say that you always want to make a 0 into a 1, but you change your mind about what is 0 and what is 1. After a 1 you now interpret a 0 in the input as a 1, and vice versa. That way, in the cases where you can make the change you want, that is where the arrow is green, the bit must have been what you considered 0. And in the cases where the arrow is red the bit must have been what you considered 1 at that point.

The view of flipping the interpretation of bits actually matches the way Gray code is constructed. To see this consider how you increment from 7 to 8 in Gray and standard binary. With standard binary you go from 0111 to 1000, flipping all four bits. In Gray you go from 0100 to 1100 which is in some sense equivalent except that when you flip the highest bit you don’t have to explicitly also flip the lower bits because they are changed implicitly by changing your interpretation of 0 and 1. Viewing the numbers and the algorithm this way makes it easier to understand intuitively that the bits where you can make the change you want must correspond to a 0 in the binary representation and the bits where you can’t must correspond to a 1. Under this view the direction value tracks your view: if a direction bit is 0 you view the bits normally, if it is 1 you view them in inverse.

But wait, there’s more. The simplest way to Gray encode a binary number n is using xor:

int gray(int n) {
return n ^ (n >> 1);

As we saw before the color value above is n and the direction is n right-shifted once. So in fact if you xor together the color and direction values you get the value you started with. And if you check with the examples above that is indeed the case. Intuitively, since the direction value determined how view view the bits, if there is a 0 in the direction value the corresponding bit in the Gray value is the same as in the binary value. If the direction value has a 1 you have to flip the bit in the Gray code to get the binary value. This is exactly how xor works, so xoring the direction value and the Gray value will give you the binary value. And, since xor is associative, xoring the direction and the binary value gives you the Gray value.

There is one point left to revisit which is the recipe given in the wikipedia article for incrementing a Grey code. Here it is again:

[…] at step i find the bit position of the least significant ‘1’ in the binary representation of i – flip the bit at that position in the previous code to get the next code.

We can now tie this together with the flip-flop algorithm. Since the standard representation of i is the same as the color value, what this is saying is: find the least significant 1-bit in the color value, which is the least significant red arrow, and flip it. This seems wrong – surely you’re supposed to flip the least significant green arrow? Let’s try their algorithm.

Take a random number, say the Gray code for 4, 110. The binary representation of 4 is 100. The least significant (and only) 1 is in the most significant bit position. Flipping that in the Gray code gives you 010. This is actually the code for 3. So what seemed wrong intuitively, flipping the red arrow, was indeed wrong.

Instead what the flip-flop algorithm suggests is to flip the least significant green arrow, or 0-bit in the standard representation. Let’s try that. The least significant 0 in 100 is at the least significant bit position. Flipping that in the Gray code gives 111, which is indeed the code for 5. So it works in the example, and it works in general. Someone should probably fix that wikipedia article.

1: Technically there’s more than one encoding that can be called Gray code but people usually mean binary-reflected Gray code and that’s what I mean too.

2: I have no doubt this algorithm is well known so the fact that I’m naming it doesn’t mean I’m claiming to have invented something new – it’s just convenient to have a name for it

Tracing Promises

Ever so often I find myself having to write JavaScript programs that interact with the world through XMLHttpRequests. And every time the first thing I do before I actually get to send off requests is implement a small promise framework. Promises are just so much more powerful than callbacks that at this point I wouldn’t even consider writing raw callback-based asynchronous code.

One thing about promises though, especially if you program is just moderately complex, is that they make it much more difficult to follow the flow of your code. Everything happens asynchronously in short callbacks and when an error occurs it can be near impossible to piece together what happened. This post is about how to solve that problem.

Promises in general

Before getting to the part about tracing them here are some examples of uses of promises, to set the stage for the kind of operations we’ll later want to trace. These examples use idioms from my own promise library but these considerations apply to any of the libraries out there, the concrete patterns may just be different.

One place I relied heavily on promises was in the implementation of the sputnik test web runner (which has now been subsumed by test262). The function for firing off an XHR using google’s closure library looked like this:

function makeRequest(path) {
var result = new Promise();
goog.net.XhrIo.send(path, function () {
}, "GET", undefined, undefined, 0);
return result;

// Fetch google.com and alert the result.

What this function does is return a promise object that gets fulfilled when the XHR, which is sent off at the same time, completes and invokes its callback. This may look like a whole lot of nothing – basically making a callback into a promise – but the advantage comes at the call site. Say you have two mirrors of the same data and want to send a request to both and use the first response you get back. With promises that is trivial:

// Fetch from both mirrors, use the first response.
var response = Promise.select(makeRequest(mirrorOne),

The select operator takes any number of promises and returns a new promise that resolves as soon as any of the child promises resolve. Another use for select to set timeouts:

// Fetch path but time out after one second.
var response = Promise.select(makeRequest(path),
Promise.failAfter("Promise timed out", 1000));

// Alert successful response, log any errors in the console.

Here the response will resolve either to the result of the request, if it happens within one second, or else will fail with the string "Promise timed out" when the result of failAfter fails after one second.

To get the results of several concurrent promises you can use join. This code sends off two XHRs at once, waits until they’re both complete, and then alerts the concatenation of the results.

// Fetch pathOne and pathTwo in parallel.
var request = Promise.join(makeRequest(pathOne),

// When done, concatenate the two responses.
var response = request.thenApply(function (dataOne, dataTwo) {
return dataOne + dataTwo;

// Finally alert the result.

The join operator takes a list of promises and returns a promise that resolves to the list of values of all the child promises once they’re been resolved. The thenApply operator on a promise returns a new promise whose value will be the result of applying the function to the value of the original promise.

Note that these examples are a lot more verbose than you would usually make such code. I wrote them like that to make it clearer what’s going on. Some operations would be packed into convenience methods, like setting a timeout, and you would generally use method chaining. With that the last example, with the addition of a three second timeout, would look like this:

var response = Promise
.join(makeRequest(pathOne), makeRequest(pathTwo))
.thenApply(function (a, b) { return a + b; })

Packs quite a punch doesn’t it? This is what I mean when I say that promises are more powerful than callbacks; it’s not that one is more expressive than the other, they’re computationally equivalent, but in terms of conciseness and power of abstraction there’s just no competition. And this is just a small part of what you can do with promises. A good place to see the power of this style of programming in action is twitter’s finagle stack.

The Errors! The Errors!

All’s well with this model as long as all operations succeed. However, as soon as operations start failing this is a mess. Consider for instance the case where you have numerous concurrent operations running that have timeouts. Errors are propagated through promises nicely so if you’ve set up a failure handler you’ll see the error, but it will look something like this:

Operation timed out
at promise.js:108:18

And that’s in the best case where error objects have stack traces. Since each step in the chain of promises is detached and runs in a separate turn of the event loop all you’ll see is that some timeout somewhere fired. There’s no telling which operation took too long. As soon as your program is just moderately complex this becomes a major issue. It’s like with normal synchronous programs if there were no stack traces. Stack traces are there for a reason: it sucks to debug without them.

The stack trace analogy is useful in understanding what information you’d want to have, if you could, when a promise fails. All these patterns of asynchronous operations correspond, when you’re only interested in error reporting, to simple call/return patterns we’re used to. For instance, the join operation is just an asynchronous version of taking a list of functions, calling each one, and returning a list of their results:

function syncJoin(funs) {
var result = [];
for (var i = 0; i ... funs.length; i++)
return result;

In this case if calling any of the functions fails then syncJoin will fail and both calls will show up in the stack trace. The same thing holds for join on promises: if join fails because one of the child promises fail you want information about both. The same thing holds for select, for the same reason.

This all suggests that when a promise fails and propagates that failure through a number of other promises what you as a programmer need is information about each promise in that chain. Something like:

Operation timed out
at promise.js:108:18
--- triggering the failure of ---
at Function.withTimeout (promise.js:106:17)
at runDemo (demo.js:226:11)
at demo.html:7:168
at onload (demo.html:8:4)
--- triggering the failure of ---
at Function.select (promise.js:50:16)
at Function.withTimeout (promise.js:110:18)
at runDemo (demo.js:226:11)
at demo.html:7:168
at onload (demo.html:8:4)

This is a trace created using the promise trace library I wrote last week to trace down some issues with a chrome extension I was working on. A promise trace is made up of segments, each one a stack trace identifying a promise. The top segment is the error that caused the failure, in this case exactly the same timeout as before. The next one is the promise created by withTimeout that fails when the timeout is exceeded. The bottom segment is the select between the operation and the timeout, and going to demo.js:226 will tell you which operation it was.

The trace above is from chrome; here’s the result of the same failure but in firefox

Operation timed out
--- triggering the failure of ---
([object Object],100)@promise.js:106
onload([object Event])@demo.html:1
--- triggering the failure of ---
([object Object],[object Object])@promise.js:50
([object Object],100)@promise.js:110
onload([object Event])@demo.html:1

It’s a bit more cluttered since the firefox stack trace api doesn’t support stripping irrelevant top frames but both contain the relevant clues you need when debugging.

Both traces contain a fair amount of redundant information. For instance, the two bottom promises are created close to each other so the bottom of their stack traces are identical. To make the traces easier to read the promise trace library folds away overlapping stack traces. The actual trace that would be printed for the above error is

Operation timed out
at promise.js:108:18
--- triggering the failure of ---
at Function.withTimeout (promise.js:106:17)
at runDemo (demo.js:226:11)
at demo.html:7:168
at onload (demo.html:8:4)
--- triggering the failure of ---
at Function.select (promise.js:50:16)
at Function.withTimeout (promise.js:110:18)
at runDemo (demo.js:226:11)
(... rest of trace same as previous ...)

This is probably still more cluttered and redundant that it needs to be – in my experience for each segment you only need to know the top frame below the call into the promise library and the bottom frame that is different from the previous segment. But I’m not sure that’s true in all cases so for now this is as small as I’d like to make it.

Getting promise traces

How does this api work? From a user’s perspective all you see is that on failure your callback is given two values: the error and a trace. You’d typically use it something like this:

myOperation.onFailed(function (error, trace) {

This requires some support from the promise library but not a lot actually. Whenever a promise is created it has to create a PromiseTraceSegment value and when propagating a failure it should create and propagate a chain of PromiseTrace objects which collect the relevant segments. In my own promise library the tracing code makes up maybe ten lines.


The way this is implemented, as is probably evident from the examples above, is by capturing a stack trace every time a promise is created (there’s also a flag to disable trace capturing altogether). The whole thing is actually pretty straightforward, the whole trace library is ~200 lines of which most are for formatting the output. Capturing stack traces is not super expensive generally but even relatively cheap operations add up if you do them often enough. To get a sense for  how expensive tracing is I wrote a simple ludicrously asynchronous implementation of the Fibonacci function:

function lazyFib(n) {
if (n == 0 || n == 1) {
return Promise.of(1);
} else {
return Promise.deferLazy(function () {
return Promise
.join(lazyFib(n - 2), lazyFib(n - 1))
.thenApply(function (a, b) { return a + b;});

I then tried running it with and without promise tracing. In chrome tracing makes it around 60% slower on average. In firefox the cost is only around 20% but then this program was a lot slower overall. I believe this is because in firefox the event queue runs at a slower rate by design. In safari you can enable tracing and it’s essentially free but in the version of safari I tried you can’t actually get a stack trace so the promise traces produced are useless. I believe stack trace support is coming to safari too though.

This example is the absolute worst case performance cost; nobody would write as promise-heavy a program as that example. To also get a sense for the impact on a more realistic program I wrote a larger benchmark, one that simulates an RPC system with a string repository that holds a large number of strings which have to be fetched asynchronously, a few at a time, and then processed by a chain of other asynchronous operations. It was still ridiculously asynchronous, much more than any realistic program, but at least not quite as trivial. This time the performance cost in chrome was 0.8% on average. Statistically significant but tiny.


Debugging asynchronous JavaScript programs is difficult. However, adding support to libraries that implement asynchronous abstractions like promises is straightforward and the overhead of collecting traces is likely to be insignificant for any realistic JavaScript program.


The first large JavaScript program I wrote after ES5 came out was my mercury chrome extension. I was surprised at how big a difference to my code style some relatively small changes to the language makes. Before then I had followed the process that led to ES5 and knew about those features long before I tried them but it was only when I actually tried them I got as sense and feel for how they felt to use. I thought, wouldn’t it be useful if you could get that while developing the language rather than after?

So a few months ago I sat down to write a language experimentation framework for JavaScript, MyJS. What MyJS does is allow you to define a language extension, including new syntax and/or library functions. A source file (html or js) can then specify which extensions it uses and the parser and translation framework will plug extensions together as appropriate to parse your program and convert it to plain JavaScript. How it works is worth a blog post of its own but here I’ll outline how you define a language extension and use it.


You can try MyJS by going to this test page (I haven’t tested this in all browsers, it probably doesn’t work in all of them). The program that prints the output you see on that page uses the delegates extension, the operator ::, which returns a python-style bound method. This is what the program looks like:
// A Printer can print text to a document.
function Printer(prefix) {
this.prefix = prefix;

// Add this printer's prefix and print the given value.
Printer.prototype.print = function(value) {
var elm = document.createElement('div');
elm.innerText = this.prefix + value;

var p = new Printer("Looky here: ");

[1, 2, 3].map(p::print);
(::print)(p, "Where?");
(The idea for the delegates operator is Erik Corry‘s) In this case the binary (“bound”) :: operator returns a function that, when called, calls p.print with the arguments it’s given. So (p::print)("foo") means exactly the same as p.print("foo"). The unary :: (“unbound”) operator returns a function that uses its first argument as the receiver and the rest as arguments. So (::print)(p, "bar") is the same as p.print("bar"). This is just a random example of an extension, the operator itself isn’t that important.

Using a dialect

The way you use an extension is by using a MyJS specific script tag:

This code first includes the MyJS java library. It then includes a script called delegates.js, which defines the delegates extension. MyJS comes with a few language extensions built in which makes it easier to write new language extensions. Delegates uses one of those, myjs.Quote. Finally, once we’ve loaded the delegates extension the third script element can use the extended language. Easy!

Defining an extension

How do you define a language extension? It has four parts: defining syntax trees, describing the concrete syntax, implementing the rules for translating the new syntax trees into plain JavaScript, and finally hooking it all into the framewrk. The full definition of the delegates extension is here. For the delegates extension we need two new syntax trees, one for bound and one for unbound delegates:

// Bound expression syntax tree node.
function BoundMethodExpression(atom, name) {
this.type = 'BoundMethodExpression';
this.atom = atom;
this.name = name;

// Unbound method syntax tree node.
function UnboundMethodExpression(name) {
this.type = 'UnboundMethodExpression';
this.name = name;

The syntax tree format is modelled on Mozilla’s Parser API and the parser for the standard language constructs produces parser api compatible syntax trees.

To produce these we need to hook into the grammar. The syntax is pretty straightforward:

::= "::" Identifier

::= "::" Identifier

(A left hand side suffix is basically the right hand side of an operator expression). The way to express this in MyJS is to map it into calls to the grammar constructor library:

function getSyntax() {
// Suffix syntax tree builder helper.
function BoundMethodSuffix(name) {
this.name = name;

// Apply this suffix to a base expression.
BoundMethodSuffix.prototype.apply = function(atom) {
return new BoundMethodExpression(atom, this.name);

// Build the syntax
var f = myjs.factory;
var syntax = myjs.Syntax.create();

// LeftHandSideSuffix ::= "::" Identifier
.addProd(f.punct('::'), f.nonterm('Identifier'))

// PrimaryExpression ::= "::" Identifier
.addProd(f.punct('::'), f.nonterm('Identifier'))

return syntax;

There’s a bit of a quirk here because a left hand suffix doesn’t correspond to a “full” syntax tree, it has the left hand side missing. So instead it returns a suffix object that will be called later by the parser with the left hand side as an argument and must return the full syntax tree.

The reason we don’t just build the syntax but define a function for doing it is that we only want to build it if the dialect is used. If it is just defined but never used it’s a waste of time to build the syntax.

This allows the framework to parse extended code and build syntax trees. The next part is to define how they are translated into plain JavaScript:

BoundMethodExpression.prototype.translate = function(context) {
return #Expression((function(temp) {
return temp.,(context.translate(this.name)).bind(temp);

UnboundMethodExpression.prototype.translate = function(context) {
return #Expression(function(recv, var_args) {
return recv.,(context.translate(this.name)).apply(recv,
Array.prototype.splice.call(arguments, 1));

This is where it gets a little bit tricky. This code uses a language extension, myjs.Quote, to plug together syntax trees. But while this is tricky the first time you see it is a lot better than having to build syntax trees by hand without quoting. The #Expression part means: the following code should be parsed as an expression but don’t run the code, return the syntax tree. The commas mean: this must be evaluated and the result, which is a syntax tree, must be spliced into the surrounding tree. If you know quasiquote from scheme that’s basically what it is. What this code does is say: if you have the syntax tree A::B translate it into something like


except that will cause a to be executed twice so we use

(function (t) { return (t.B).bind(t); })(A)

and similarly for ::A. The recursive calls to translate are there to translate the subexpressions.

The last part is to hook it all into the framework. That’s done using this incantation:

myjs.registerFragment(new myjs.Fragment('demo.Delegates')
.registerType('BoundMethodExpression', BoundMethodExpression)
.registerType('UnboundMethodExpression', UnboundMethodExpression));

Here we give the extension a name, demo.Delegates, register the function that will return the syntax to use, and register the two new types of syntax trees. That’s what it takes to define an extension.

Fragments also allow you to set up install hooks, so if your extension needs to add library functions to the global object for instance you can specify a function that is given the global object as an argument and can install any functions and methods it needs.

…six months later

I sort of ran out of steam on this project six months ago so while the “hard” part is there, the extensible parser framework, the basic language definition (all the standard language constructs are defined as if they were extensions), the translation framework, etc., there is still some work left to do before everything works – for instance, before you could define the harmony classes syntax. It would also be great if you could specify a dialect the same way you specify strict mode, within the script (“using demo.Delegates”).

I also played around with making this work with node.js but I couldn’t find the hooks in the module importing primitives that would allow me to intercept module loads and do the source translation.

The code lives on github.