## Analytical Programming

This is my third post about Babbage’s calculating engines. The first two were about the difference engine: why it was important at the time and how it worked. This post is about the analytical engine. The analytical engine is famously the first programmable computing machine, and there was much programming involved both in designing and operating it. In this post I’ll take a look at the various ways you could program the engine and the way programs were used to control the engine internally.

## Programming the analytical engine

When the analytical engine is mentioned as the first programmable computer the example you almost always see is one particular program, the one for calculating Bernoulli numbers that was given in note G of Ada Lovelace’s notes on the engine. But that’s only the tip of the iceberg. This post is about the rest of the iceberg. (Don’t worry, I’ll give the Bernoulli program a whole post of its own). The engine could be programmed on roughly three different levels and we’ll take a look at each of them in some detail, but first I’ll give an brief overview of each of them.

At the highest level programs were written as tables like the one here on the right. That one is taken from Menabrea’s 1842 article about the engine. Each row in the table is an instruction, a mathematical operation to be performed. In modern terms we would call this a register language since all the operation’s inputs and outputs are given explicitly. This is the format that was used in all contemporary articles about the engine and the format of the Bernoulli program. However, a program in table form could obviously not be run by the analytical engine directly, it was more like what we would call pseudocode today. It describes the program you want to execute but it’s not executable itself.

The way you made executable programs was using punched cards. To run a program written in the table format you would have to translate it into a stack of cards that could be interpreted by the machine. You might think of the cards as a type of bytecode. Babbage seems to have considered this mostly an implementation detail so it’s not super well described, but we still know enough to get a pretty good sense for how card-based programming would have worked.

At the bottom layer there was an internal “microcode” format that controlled how the engine executed each of the punched-card encoded instructions. The microcode programs were encoded as rows of pegs on the side of rotating barrels, like the pins on a music box. The pins controlled operations and data flow within the engine and the control flow of the microprograms themselves. Some of the more complex instructions such as multiplication and division had very elaborate implementations of more than a hundred verticals, Babbage’s name for a single micro-instruction.

In the rest of this post I’ll describe two of these formats, tables and microcode. The punched card format has a separate post which is linked at the end. First though, a quick note on sources. My source for most of this post is some excellent articles by Allan Bromley: The Evolution of Babbage’s Calculating Engines from 1987 and Babbage’s Analytical Engine Plans 28 and 28a – The Programmer’s Interface from 2000. If you want more information these are the articles to read. (Obscenely they are both behind IEEE’s paywall which I suspect is one reason they’re not as widely read as they deserve to be.)

With that let’s get on to the first language level: tables.

## Tables

The basic model of the analytical engine is similar to the difference engine but generalized along several dimensions. The difference engine had 8 columns, what we would call registers, with 31 decimal digits of precision (roughly 103 bits). These could be added together in a fixed pattern, right to left. The analytical engine had a much larger number of columns, Babbage considered 1000 to be realistic, and it could add, subtract, multiply, and divide them in any pattern. The columns also had more precision, 50 decimal digits (roughly 166 bits). Each column had an index, i; the i‘th column is written as Vi. The V stands for variable which I’ll use interchangeably with the word column.

The table format for programming the engine, the most high-level format, represents a sequence of instructions as rows in a table. Each row specifies an operation along with the input and output columns. For instance, to calculate (V1 + V2 + V3) and store the result in V4 you would do something like:

# op in out
1 + V1 + V2 V4
2 + V3 + V4 V4

The first instruction adds V1 and V2, storing the result in V4, and the second adds V3 to V4. It’s pretty straightforward really – but in this simple example I’ve cheated and omitted a few details. We’ll be adding those back now.

In modern programming languages we’re used to being able to read a variable as many times as we want with no side-effects. With the analytical engine on the other hand when you read a column what you’re actually doing is transferring the value mechanically from the column to the processing unit, the mill, which causes the column to be cleared. It’s obviously inconvenient if you can’t read a column more than once. To solve this the engine supported two kinds of reads: the simple read where the column is cleared in the process and a the restoring read where the column retains its value. A restoring read works by simultaneously transferring the value to the mill and a temporary storage column and then immediately transferring the value back from temporary storage to the column.

To indicate which kind of read to do there’s an extra field in the table indicating column value changes. If we go back to the program from before, let’s say that we don’t mind that V2 and V3 are cleared but we need to retain V1. We would express that as

# op in out vars
1 + V1 + V2 V4 V1 = V1
V2 = 0
2 + V3 + V4 V4 V3 = 0

In the first operation we want to restore V1 after it’s been read but let V2 get cleared. In the second instruction we let V3 get cleared and we don’t need to specify what happens when we read V4 because that’s where the result is stored.

This program contains the full information you would need to be able to run it on the engine. The original tables annotate programs some more but anything beyond this is more like comments, it doesn’t change the behavior of the program.

One additional piece of information you’ll see in most of the original programs, the one on the right here is another of Menabrea’s, is that all the column names have a second index in superscript. So where I’ve written V1 one of the original tables would have something like 1V1. The second index indicates how many times the variable has changed. So 1V1 means “the first value stored in V1“, 2V1 means “the second value stored in V1, after it had been overwritten once”. This doesn’t mean that you can recover old values of a variable, it’s just for the programmer to keep track of what the current value is of each variable. You can also write 0V1 which means the original 0 stored in V1 in the case where we haven’t written to that column at all yet. If we add in these indices the program will look like this:

# op in out vars
1 + 1V1 + 1V2 1V4 1V1 = 1V1
1V2 = 0V2
2 + 1V3 + 1V4 2V4 1V3 = 0V3

(The 0V2 assignment is just a convention, it means the same as resetting V2 to its initial state where it contained 0).

This is the language used to write the first computer programs. Even though it’s unusual it will look familiar to any modern programmer familiar with assembly programming. There is no way to specify control flow even though there is some indication in the Bernoulli program that it had been considered. These are basically straight-line sequences of mathematical operations on a set of variables. And being pseudocode the conventions weren’t fixed, they were changed and adapted by the authors to fit the programs they were writing.

The microprogram format is at the other end of the spectrum; where the tables were high-level and meant for communicating programs clearly the microprograms where low-level and written and understood only by Babbage, at least until recently.

## Microprograms

The analytical engine could perform a number of complex operations including multiplication and division. To give a sense for how complex I’ll give an example of how the engine would multiply two numbers.

Say we instruct the engine to multiply 17932 with 2379. The multiplication logic would first determine which of the operands had the fewest digits and put that in one register. (The computing mill had a number of internal storage columns that were used to hold partial results during individual operations. I’ll call those registers to distinguish them from the user-accessible columns). The other number, the one with most digits would be used to generate a table of all the multiples of that number from 2 to 9, using addition. In this case that’s 17932:

factor value
1 17932
2 35864
3 53796
4 71728
5 89660
6 107592
7 125524
8 143456
9 161388

Once this table had been built the engine would scan through the other number, in this case 2379. For each digit it would look up the corresponding multiple in the table, shift it left by an amount corresponding to the significance of the digit (that’s base 10 shift), and add the resulting values as it went:

digit product
9
161388
70
1255240
300
5379600
2000
35864000

Adding those four values together you get 42660228, the product of 17932 and 2379, calculated using the primitive operations of addition and multiplication by 10. The whole operation took time proportional to the number of digits of the shortest of the input numbers. Unlike the difference engine which stored numbers as tens complement the analytical engine stored the sign of the number as a separate bit. That way the multiplication could be unsigned and the sign could be computed separately. This meant that the engine had two zeroes, plus and minus.

Back to multiplication. The engine needed an internal control mechanism to take it through this complex process and what it used was a microprogram encoded by pegs on a rotating barrel. You can see a simplified diagram of the barrel here on the right. Each “cycle” of the microprogram proceeds in the same way.

First the barrel is pushed left and the pegs along the vertical edge press against a row of levers which causes them to engage or disengage other parts of the engine. Because of the way the pegs are aligned with the vertical edge of the barrel Babbage called a single such instruction a vertical.

Second, the barrel is pushed to the right and connects to the big gear you see to the right of it in the diagram. That gear, and the gears connected with it, Babbage called the reducing apparatus. That’s what controls the flow of the program. The reducing apparatus rotates the barrel some amount in either direction to select the next vertical to apply. At the same time any other components that were engaged by the current vertical perform their operation, for instance a single step of building the multiplication table. The reducing apparatus takes input from those other components so for instance it may move the barrel different amounts depending on whether the last addition overflowed. That’s the arm on the far right (Babbage called overflow “running up”). The reducing apparatus is controlled by the barrel itself so each vertical explicitly specifies how the reducing apparatus should rotate it to the next vertical. You’ll notice that the three gears you see near the reducing apparatus’ main gear have 1, 2, and 4 teeth respectively. By engaging a combination of them one vertical could have the reducing apparatus rotate the barrel any number of steps between 1 to 7. In modern terms, each micro-instruction contains an explicit relative branch, conditional or unconditional, to the next microinstruction. As you can see this is a highly sophisticated and general mechanism. The only disadvantage is that it’s slow – a single cycle takes a few seconds so a multiplication can take several minutes.

As you would expect the diagram above is simplified, in practice there were multiple barrels and they were much larger both in the number of pegs for each vertical and number of verticals per drum. I haven’t been able to find any of Babbage’s actual microprograms unfortunately so for now all I know are the basic principles, and we know that designing them was one of Babbage’s main interests in designing the engine.

The third program format is the punched cards which is what would have been used by an operator of the engine. I’ll look at those in some detail because they give a good sense of what it would have been like to work with the engine in practice. To keep this post to a reasonable length I’ve broken that out into its own post called punched cards.

## Difference

This is my second post in a series about Babbage’s mechanical calculating engines. The first post was about why it was so important in the early 1800s to be able to produce accurate arithmetic tables and ended with what I’ll be generous and call a cliffhanger: that Babbage’s difference engine was a groundbreaking solution to that problem. In this post I’ll explain how the difference engine works, how to “code” it and how to interpret what you get back.

Before we get to the engine itself there’s a bit of math to explain the basic principles it works by. To give you a taste of the engine itself before that here is a small JavaScript emulator1 that performs a simple calculation. It’s tiny compared to the real engine but works by the same principles just on a smaller scale.

Try stepping it a few times. The model is actually quite straightforward and the computations you can use it for are based on some basic mathematical principles. You can treat it as an exercise before reading the rest if you like: try to work out what it’s calculating, how it does it, and how you can make it calculate other things. Of course you can also just read on an I’ll tell you.

## Differences

The difference engine is a very simple device in principle. It’s an adding machine, that’s all it is2. Part of what makes it such an impressive accomplishment is that it took a difficult problem, calculating complex functions, and solved it using a simple enough approach that it could be implemented mechanically using contemporary technology3.

So, how do you calculate complex functions using just addition? Obviously you can’t in general so let’s start off easy by looking just at polynomials.

Consider this simple polynomial, the square function:

$f(x) = x^2$

The first few values are

$\begin{tabular}{lcl} f(0) & = & 0 \\ f(1) & = & 1 \\ f(2) & = & 4 \\ f(3) & = & 9 \\ & \vdots & \end{tabular}$

The difference engine is based on a technique called divided differences. Divided differences is similar to differentiation but simpler and it’s based on simple arithmetic. It works as follows. You take the values of your polynomial at a fixed interval, here we’ll use the first four values from before:

$\begin{tabular}{l} 0 \\ \\ 1 \\ \\ 4 \\ \\ 9 \end{tabular}$

Then you find the distance between each successive pair of them:

$\begin{tabular}{l|l} 0 \\ & 1 \\ 1 \\ & 3 \\ 4 \\ & 5 \\ 9 \end{tabular}$

These are the first differences. Then you find the distance between the first differences the same way:

$\begin{tabular}{l|l|l} 0 \\ & 1 \\ 1 && 2\\ & 3 \\ 4 && 2\\ & 5 \\ 9 \end{tabular}$

These are the second differences. And one last time to get the third differences:

$\begin{tabular}{l|l|l|l} 0 \\ & 1 \\ 1 && 2\\ & 3 && 0\\ 4 && 2\\ & 5 \\ 9 \end{tabular}$

You can see the similarity to differentiation: it’s a polynomial of degree 2 so the first differences increase linearly just like the first derivative, the second differences are constant just like the second derivative, and so on. We don’t actually need the third differences, they’ll all be 0 anyway, so I’ll leave those out below.

What’s neat is that once you have these values you can extend the table using nothing but addition. You know the difference between the first derivatives is fixed, 2, so you can get the next first derivative by adding 2 to the previous one. And you know the difference between the function values is the first differences so you can get the next value just by adding the next first difference to the previous function value. Okay maybe it’s easier to explain with a concrete example:

$\begin{tabular}{l|l|l} 0 \\ & 1 \\ 1 && 2\\ & 3 \\ 4 && 2 \\ & 5 \\ 9 \end{tabular} \quad\to\quad \begin{tabular}{l|l|l} 0 \\ & 1 \\ 1 && 2\\ & 3 \\ 4 && 2 \\ & 5 & \tiny{+0} \\ 9 && \bf{2} \end{tabular} \quad\to\quad \begin{tabular}{l|l|l} 0 \\ & 1 \\ 1 && 2\\ & 3 \\ 4 && 2 \\ & 5 \\ 9 & \tiny{+2} & \bf{2} \\ & \bf{7} \\ \end{tabular} \quad\to\quad \begin{tabular}{l|l|l} 0 \\ & 1 \\ 1 && 2\\ & 3 \\ 4 && 2 \\ & 5 \\ 9 && \bf{2} \\ \tiny{+7} & \bf{7} \\ \bf{16} \\ \end{tabular}$

Notice that we don’t need the full table at this point, we only need that for calculating the initial values. All we need to generate more values is the last of each of the differences:

$\begin{tabular}{l|l|l} \multicolumn{1}{r}{}&&2 \\ & 7 & \tiny{+0}\\ 16 &\tiny{+2}& 2 \\ \tiny{+9} & 9 \\ \bf{25} \\ \end{tabular} \quad\to\quad \begin{tabular}{l|l|l} \multicolumn{1}{r}{}&&2 \\ & 9 & \tiny{+0}\\ 25 &\tiny{+2}& 2 \\ \tiny{+11} & 11 \\ \bf{36} \\ \end{tabular} \quad\to\quad \begin{tabular}{l|l|l} \multicolumn{1}{r}{}&&2 \\ & 11 & \tiny{+0}\\ 36 &\tiny{+2}& 2 \\ \tiny{+13} & 13 \\ \bf{49} \\ \end{tabular} \quad\to\quad\dots$

This provably works for any polynomial. To generate a sequence of values for a polynomial of degree n all you need is n+1 initial values; from the values you calculate the table of differences using subtraction, and from there on you can calculate as many successive values as you like using just addition. You don’t even need to know the closed form of the polynomial as long as you can evaluate the initial values at fixed intervals.

This is the basic principle of the difference engine, and what it’s named after. The engine has 8 integer registers called columns that can each hold a 31-digit integer value which represent the current value of the function and the first to the seventh difference. By cranking the handle those values are added together from right to left. Here is a another mini-emulator, this one calculating the square function using the differences we just calculated:

You can see the values are being added together from right to left and the current function value in the leftmost column is printed for each emulated crank on the handle. Printing was also a part of the original difference engine. A common source of errors in mathematical tables was typesetting so to avoid that step the engine would automatically stamp its output in a soft material that could be used directly for printing, as well as print a log on paper.

Being able to evaluate an integer polynomial is just the beginning though. First of all, integers aren’t enough, we need to be able to evaluate real-valued functions. Secondly, so far we’ve only seen positive values, we also need negatives. Finally, polynomials can be useful in themselves but we’re really more interested in more complex functions like logarithms and trigonometric or astronomical functions. But with a few tricks the difference engine can handle all those things.

## Precision

First off: how do we use this to evaluate real-valued functions? You use fixed-point decimal numbers. For instance, say we want to plot the square function from before but this time in steps of 0.25:

$\begin{tabular}{lcl|l|l} f(0) & = & 0 \\ &&& 0.0625 \\ f(0.25) & = & 0.0625 && 0.125 \\ &&& 0.1875 \\ f(0.5) & = & 0.25 \end{tabular}$

These are fractional numbers but if you multiply them by 105 we’re back to integers

$\begin{tabular}{lcl|l|l} 10000 f(0) & = & 0 \\ &&& 625 \\ 10000 f(0.25) & = & 625 && 1250 \\ &&& 1875 \\ 10000 f(0.5) & = & 2500 \\ \end{tabular}$

Now we’re back to something the engine can calculate:

I’ve added a line between the digits to mark where the decimal point goes. It also gets added to the output (I don’t believe the original engine did this). But the decimal point is purely a matter of interpretation by the operator, the adding mechanism is not aware of it, it’s operating purely on 6-digit integer values.

In this case we were lucky because there was a relatively small factor of 10 we could multiply onto the values to get integers without losing precision. That’s unlikely to be the case in general. If you can’t use this trick you multiply the values with as large a factor of 10 as you have precision for and just bite the bullet, round the scaled values to the nearest integers and lose some accuracy. That’s not necessarily as bad as it sounds. The original design had 31 digit precision in base 10 which corresponds to roughly 103 bits, well beyond the already quite large 64-bit integers on most modern machines. So you can afford to scale the values up quite a bit before rounding. We’ll see an example in a bit of how long it takes for errors to accumulate.

## Negative values

To represent negative values we use the exact same trick as with binary numbers, just in base 10: tens complement. A negative number d is represented as 10n – d where n is the number of decimal digits, so -1 is represented by 999…999 and so forth. The adding mechanism itself has no concept of negative values, just like with twos complement the correct behavior just falls out of how overflow works. It’s up to the operator or printer to interpret the output correctly as signed or unsigned values.

Here is an example of a function that starts off positive, peaks, and then descends into the negative numbers.

You’ll notice that the numbers in the columns are all positive but the ones that represent negative values are printed as negative. As with the decimal point that’s a hack I added in the printer which makes it print smaller integer values as positive and larger ones as negative. But it’s purely a matter of interpretation, the calculating part of the engine is oblivious.

## Polynomial approximation

Okay, now we’re getting close to the goal: being able to produce accurate arithmetic tables of general functions.

In my previous post the main example of how important mathematical tables were was how astronomical tables were used by mariners to navigate by lunar distance. I don’t actually know the math underlying those astronomical tables so here I’ll use an example I do understand: trigonometric functions. On the right here is a table of trigonometric functions from 1785. It gives 7 digits of the values of 8 different trignonmetric functions, sin, cos, tan, etc., for each arcminute between 0° and 45°. There’s 60 arcminutes to one degree so that’s 2700 values for each function, 21,600 values in total. The author of this table, Charles Hutton, said about this edition in a later one that it had so many errors that is was basically useless:

Finding, as well from the report of others, as from my own experience, that those editions […] were so very incorrectly printed, the errors being multiplied beyond all tolerable bounds, and no dependence to be placed on them for any thing of real practice […]

For this last part about how to calculate complex functions I’ll show how to replicate one column, sin, of this table.

Since we know how to evaluate polynomials the solution that first springs to mind is to approximate the function we want to calculate using a polynomial. Taylor polynomials were well known at this time so that’s an obvious approach. Taylor’s theorem says that for an infinitely differentiable function f (which sin is),

$f(x) = \sum_{k=0}^{\infty}\frac{f^{(k)}(a)}{k!}(x-a)^k$

where f(k) means f differentiated n times and a is any point on the function. Since the engine only has 8 columns and we need n+1 columns for a polynomial of degree n we have to limit ourselves to at most the first 7 terms. And in fact, since I want to demonstrate this with the emulator in a little bit and 8 columns of 31 digits takes up an enormous amount of space we’ll limit ourselves even more in this example to 4 columns of 13 digits. This means that we’ll use only the first 3 terms of the Taylor polynomial. For sin those are:

$\sin(x) \approx x - \frac{x^3}{3!}$

(Conveniently all the even degree terms are 0). This approximates sin quite well around 0 so we’ll use that as the basis for generating the table.

To calculate the differences we first need to produce n+1 values at fixed intervals. The generated table should have an entry per arcminute so we’ll start at 0′ and do steps of 1′:

x sin(x)
0′ 0
1′ 2.909 10-4
2′ 5.818 10-4
3′ 8.727 10-4

Then we need to find the nth differences:

x sin(x) Δ1 Δ2 Δ3
0′ 0 2.909 10-4 -2.461 10-11 -2.461 10-11
1′ 2.909 10-4 2.909 10-4 -4.923 10-11
2′ 5.818 10-4 2.909 10-4
3′ 8.727 10-4

All of this is a matter of evaluating a polynomial so that’s not super hard to do by hand with as many decimals as you need, as long as you only need a few of them. From this table we take the last of each of the differences and that’ll be the starting point for the calculation:

0.000872664515235
0.000290888130722
-0.000000000049228
-0.000000000024614

At this point we need to decide how much accuracy we want, that is, where we want the fixed decimal point to be. We have 13 digits which gives us room enough to multiply by 1013 before rounding. That gives us these integer values:

8726645152
2908881307
-492
-246

And now we’re ready to get tabulating:

If you follow along with the original table you can see that it generates exactly the same values. The values generated by the engine continue to match the table values until 1° 1′, 57 values in, where the table says 0.0177432 and the engine produces 0.0177433, and after that it continues to produce matching values up until 1° 53′, more than 100 values in.

Not bad right? And remember, this is a simplified emulator that can only calculate the third degree approximation where the original could go up to the seventh, and only with 13 digits of precision where the original had 31.

So what’s the source of the deviation? There’s two: the approximating polynomial and the accumulating error of the engine. Let’s first look at the polynomials.

The first plot on the right is of how quickly the approximating polynomials of different degrees deviate from the true sine function. At 1° the approximations are still well within 0.01 billionth of the correct value. Like I said, near 0 these polynomials approximate sin really well.

This suggests that the main source of inaccuracy is the engine itself, the precision we had to discard when fixing the decimal point, and as you can see in the second graph, it is. The engine loses precision faster than the polynomial by a large factor. This makes sense because the inaccuracy accumulates for every step.

Luckily in this case the polynomial deviates slowly enough we can use it to calculate new almost-accurate initial values at fixed intervals, for instance for each degree, and reset the machine to those. However, eventually the polynomial itself will deviate too much and at that point we can use the fact that the Taylor polynomial has an a parameter that specifies the point around which we’re approximating. So say the polynomial that approximates around 0° becomes too inaccurate at 6° we can derive a Taylor polynomial around 6° and use that to continue the calculation. Indeed, since the polynomial approximates equally well on both sides of the point we might as well approximate around 9° and use it for all values between 6° and 12°.

Sin is a relatively easy function to approximate in this way, a function such as log is harder but the same basic principles apply to harder functions. It’s a matter of how often you need to reset to get rid of the accumulating errors and how long the same approximating polynomial remains accurate.

One of the weak points of the engine is that even though it requires less manual work than producing a table completely manually, there’s still a fair amount of manual analysis and computation to be done. That’s not a killer in itself though. Even if it took just as much work to operate, which it surely wouldn’t have, just having a different way to create these tables would have been immensely valuable since two different approaches are likely to produce different errors and hence can be used to cross-check each other. But as this illustrates, even if the engine had been built it was definitely not a matter of just plugging in a few values and then going to lunch, using it involved a fair amount of work4.

## The revolution that didn’t happen

Here’s a video of the only difference engine ever built which was completed in 2002.

Babbage tried to build it and ultimately gave up for a number of reasons including family problems, financial problems, problems working with his engineer, and his gradual change of focus to the analytical engine. Despite what you often hear the technology did exist to build it; the modern one was built from his designs and only with technology that would have been available back then.

It also appears that Babbage’s interests and those of the English government who paid for the whole thing were just too far apart. Babbage was interested in the machine itself whereas his sponsors just wanted accurate tables, whatever way Babbage could produce them. It’s a shame really. It seems from what I’ve read that the difference engine was a failure not of vision or technology but of product management. The technology was so promising that if a successful prototype had been built and he’d delivered the tables the English government wanted it’s not unlikely that they would have continued to fund research in more advanced engines. The ground would have been fertile for mechanical calculation on a larger scale by the mid 1800s. Obviously that wouldn’t have meant a mechanical iPad in every home by 1900 but it would certainly have been a better outcome than what happened, that the designs went in the drawer.

Ultimately Babbage moved on to the analytical engine, the first ever programmable computer. My next post will be about that and in particular the first ever software programs which were written for it.

## Sources

For more information about the difference engine a quick introduction is given in the first part of Menabrea’s Sketch of the Analytical Engine from 1842. A more detailed description, including of the underlying mechanics of the engine, can be found in Lardner’s Babbage’s Calculating Engine from the July 1834 edition of the Edinburgh Review.

## Notes

1: While the emulator performs the same type of calculations as the difference engine it actually looks nothing like it. I made a point to give the emulator a mechanical feel but it’s inspired more by the Curta calculator, the mechanical calculator I know best, not the difference engine. Note also that I have only tested it on the browsers I have easy access to, Chrome, Firefox, Opera, and Safari on mac. If it doesn’t work on your platform the code lives on github and patches are welcome.

2: Ada Lovelace in her famous notes about it is almost contemptuous of how simple it is compared to the analytical engine:

The Difference Engine can in reality […] do nothing but add; and any other process […] can be performed by it only just to the extent in which it is possibly, by judicious mathematical arrangement and artifices, to reduce them to a series of additions.

3: Incidentally, much of the underlying math was developed before Babbage by J. H. Müller.

4: I wonder if it’s possible to automate some of the manual work involved in operating the engine using the engine itself, like calculating initial values.

## Tables

I have a thing for mechanical calculators and it recently occurred to me that I knew almost nothing about two of the most famous ones: Babbage’s difference engine and analytical engine. This led me to read some of the papers from the mid 1800s that were written about them. This blog post is the first of a few I’m planning to write about that.

The analytical engine usually gets most of the attention but the difference engine is an interesting invention in its own right. Not only did it solve an important problem, it is the only one of the two that was complete enough to actually be built. This post about what made the difference engine so important that Babbage spent decades trying to build it and why British government was willing to pay the bill of over ₤17,000, more than the price of two warships.

## Calculation

Today computation is cheap. Extremely cheap. Imagine the amount of math that goes into just displaying the image on your screen right now: the layouts, colors, and fonts, rendering it all on a physical display, and doing it again and again quickly and smoothly enough that you don’t even notice it’s happening.

Computation is so cheap that it’s easy to forget how expensive it was before electronic calculators. It used to be that if you wanted to add two numbers together you had to actually add those numbers together. Manually. Need to multiply or divide two numbers, even just a few digits? Then you’ll have to get the paper out and do long multiplication or long division. I just did a long multiplication to make the image on the right here. I got it wrong twice before getting it right and I went from “this’ll be fun, I wonder if I still remember how to do this” to “god this is so tedious” in about 30 seconds.

And those are just the basic building blocks of doing a calculation. Most interesting computations like calculating interest or the position of the moon in six months require you to do these manual computations over and over and over again. Or require operations that you can’t easily calculate by hand, like trigonometric functions.

At this point you might be thinking: who cares where the moon is in six months? It turns out, back in those days a lot of people did. In some cases people’s lives depended on it.

## Lunar Navigation

On the right here is a table of distances in degrees on the night sky from the center of the moon to various stars at particular times. The first line gives the distance between the center of the moon and Aldebaran on March 3, 1775 at noon, 3, 6, and 9 o’clock. Multiply that by 365 days, then multiply it by a dozen stars, that gives you just some of the tables in this book, the first edition of the Nautical Almanc and Astronomical Ephemeris from 1774, published from the Royal Greenwich Observatory. The audience for the almanac were mariners. The first edition of 10,000 copies sold out immediately.

To determine your longitude at sea you need to know the current time at a fixed point. You can think of it sort of like navigating with time zones. If you know it’s 4 o’clock in the afternoon Greenwich and it’s noon where you are (which you can tell by looking at the sun) then you know you’re in the -4 time zone which is the one that goes through eastern Canada, the eastern Caribbean and central South America. This is a rough analogy but that’s the gist of how it works.

Up until around 1850, before accurate clocks were made that could be carried on long voyages, a reliable way to determine the current time was using lunar distance. The moon and stars in the night sky move as a perfectly predictable clockwork. A given configuration occurs only once, and you can calculate in advance precisely what the sky is going to look like at a later time. And more importantly you can go the other way: given the precise configuration of the sky you can calculate exactly what time it is.

Actually you don’t need the full configuration; all you need to know to calculate the time is the distance in degrees from the center of the moon to any star. That’s where the almanac comes in. It precomputes those distances so that all a navigator needs to do is measure the angle (typically using a sextant) and then look the value up in the almanac. Okay that’s actually just the basic principle, there’s a lot more to it in practice: you have to adjust for the distance from the center of the moon to the circumference, for your position on the earth, for atmospheric refraction, etc. Being a navigator takes a lot of skill. How do you make those adjustments by the way? More tables of course.

All this means that having accurate tables is extremely important. An undetected error in the almanac means a navigation error which can mean shipwreck. This is made worse because many of these tables are time dependent: one line in the almanac is useful on one day only. As a navigator you’re basically beta testing the data for every single day because nobody has had any reason to use the data before.

There are many sources of errors in numerical tables. Teams of human computers carried out the manual calculations, a tedious and error prone process. (Incidentally, it turns out that the better an understand you have of the calculation you’re carrying out the more likely you are to make mistakes as a computer.) Often the same value would be calculated by more than one human computer and then compared to catch errors – but checking is an error prone process in itself, and computers can (and did) copy from each other. Then finally someone has to manually set the values in movable type and print them, also an obvious source of errors.

Enter Charles Babbage.

## Babbage

Babbage was an unorthodox and very gifted mathematician. He was a fan of Leibniz which was still something of a heresy at his college Trinity, home of Newton, Leibniz’s arch rival. He was also one of the founders of the Analytical Society whose goal it was to replace Newton’s formalism for calculus with Leibniz’s. Incidentally, besides inventing calculus independently from Newton Leibniz designed a mechanical calculating machine, the stepped reckoner.

Babbage recognized the problem of calculating tables, as most people did, but also had a solution: the difference engine. The idea behind the difference engine is that most of the functions you want to create tables for can be approximated by a polynomial. Here is the sine function along with three approximating polynomials of increasing degree:

As the degree of the polynomial increases the approximation quickly becomes better – the degree-seven polynomial is quite close:

$f_7(x) = x - \frac{x^3}{3!} + \frac{x^3}{5!} - \frac{x^7}{7!}$

Babbage’s idea was to use mechanical means to calculate the approximating polynomials with high accuracy not just print the result on paper but do the actual typesetting to eliminate even the typographer as a source of errors.

But I’ll stop here before we get to the juicy details of how the difference engine works and save that for my next blog post.

## 0x5f3759df (appendix)

This post is an appendix to my previous post about 0x5f3759df. It probably won’t make sense if you haven’t read that post (and if you’ve already had enough of  0x5f3759df you may consider skipping this one altogether).

One point that bugged me about the derivation in the previous post was that while it explained the underlying mathematics it wasn’t very intuitive. I was basically saying “it works because of, umm, math“. So in this post I’ll go over the most abstract part again (which was actually the core of the argument). This time I’ll explain the intuition behind what’s going on. And there’ll be some more graphs. Lots of them.

The thing is, there is a perfectly intuitive explanation for why this works. The key is looking more closely at the float-to-int bit-cast operation. Let’s take that function,

int float_to_int(float f) {
return *(int*)&f;
}

and draw it as a graph:

This is a graph of the floating-point numbers from around 1 to 10 bit-cast to integers. These are very large integers (L is 223 so 128L is 1 073 741 824). The function has visible breaks at powers of two and is a straight line inbetween. This makes sense intuitively if you look at how you decode a floating-point value:

$v = (1+m)2^e$

The mantissa increases linearly from 0 to 1; those are the straight lines. Once the mantissa reaches 1 the exponent is bumped by 1 and the mantissa starts over from 0. Now it’ll be counting twice as fast since the increased exponent gives you another factor of 2 multiplied onto the value. So after a break the straight line covers twice as many numbers on the x axis as before.

Doubling the distance covered on the x axis every time you move a fixed amount on the y axis – that sounds an awful lot like log2. And if you plot float_to_int and log2 (scaled appropriately) together it’s clear that one is an approximation of the other.

In other words we have that

$\mathtt{float\_to\_int}(v) \approx (\log_2(v)+B)L$

This approximation is just a slightly different way to view the linear approximation we used in the previous post. You’ll notice that the log2 graph overshoots float_to_int in most places, they only touch at powers of 2. For better accuracy we’ll want to move it down a tiny bit – you’ll already be familiar with this adjustment: it’s σ. But I’ll leave σ out of the following since it’s not strictly required, it only improves accuracy.

You can infer the approximation formula from the above formula alone but I’ll do it slightly differently to make it easier to illustrate what’s going on.

The blue line in the graph above,  float_to_int(v), is the i in the original code and the input to our calculation. We want to do integer arithmetic on that value such we get an output that is approximately

$\mathtt{float\_to\_int}(v^{-\frac 12})$

the integer value that, if converted back to a float, gives us roughly the inverse square root of x. First, let’s plot the input and the desired output in a graph together:

The desired output is decreasing because it’s a reciprocal so the larger the input gets the smaller the output. They intersect at v=1 because the inverse square root of 1 is 1.

The first thing we’ll do to the input is to multiply it by -1 to make it decrease too:

Where before we had a very large positive value we now have is a very large negative one. So even though they look like they intersect they’re actually very far apart and on different axes: the left one for the desired output, the right one for our work-in-progress approximation.

The new value is decreasing but it’s doing it at a faster rate than the desired output. Twice as fast in fact. To match the rates we multiply it by 1/2:

Now the two graphs are decreasing at the same rate but they’re still very far apart: the target is still a very large positive value and the current approximation is a half as large negative value. So as the last step we’ll add 1.5LB to the approximation, 0.5LB to cancel out the -1/2 we multiplied it with a moment ago and 1LB to bring it up to the level of the target:

And there it is: a bit of simple arithmetic later and we now have a value that matches the desired output:

$\mathtt{float\_to\_int}(v^{-\frac 12}) \approx \frac 32LB - \frac 12 \mathtt{float\_to\_int}(v)$

The concrete value of the constant is

$\frac 32LB = \mathtt{0x5f400000}$

Again the approximation overshoots the target slightly the same way the log2 approximation did before. To improve accuracy you can subtract a small σ from B. That will give a slightly lower constant – for instance 0x5f3759df.

And with that I think it’s time to move onto something else. I promise that’s the last you’ll hear from me on the subject of that constant.

## 0x5f3759df

This post is about the magic constant 0x5f3759df and an extremely neat hack, fast inverse square root, which is where the constant comes from.

Meet the inverse square root hack:

float FastInvSqrt(float x) {
float xhalf = 0.5f * x;
int i = *(int*)&x;         // evil floating point bit level hacking
i = 0x5f3759df - (i >> 1);  // what the fuck?
x = *(float*)&i;
x = x*(1.5f-(xhalf*x*x));
return x;
}

What this code does is calculate, quickly, a good approximation for

$\frac{1}{\sqrt{x}}$

It’s a fairly well-known function these days and first became so when it appeared in the source of Quake III Arena in 2005. It was originally attributed to John Carmack but turned out to have a long history before Quake going back through SGI and 3dfx to Ardent Computer in the mid 80s to the original author Greg Walsh. The concrete code above is an adapted version of the Quake code (that’s where the comments are from).

This post has a bit of fun with this hack. It describes how it works, how to generalize it to any power between -1 and 1, and sheds some new light on the math involved.

(It does contain a fair bit of math. You can think of the equations as notes – you don’t have to read them to get the gist of the post but you should if you want the full story and/or verify for yourself that what I’m saying is correct).

## Why?

Why do you need to calculate the inverse of the square root – and need it so much that it’s worth implementing a crazy hack to make it fast? Because it’s part of a calculation you do all the time in 3D programming. In 3D graphics you use surface normals, 3-coordinate vectors of length 1, to express lighting and reflection. You use a lot of surface normals. And calculating them involves normalizing a lot of vectors. How do you normalize a vector? You find the length of the vector and then divide each of the coordinates with it. That is, you multiply each coordinate with

$\frac{1}{\sqrt{x^2+y^2+z^2}}$

Calculating $x^2+y^2+z^2$ is relatively cheap. Finding the square root and dividing by it is expensive. Enter FastInvSqrt.

## What does it do?

What does the function actually do to calculate its result? It has 4 main steps. First it reinterprets the bits of the floating-point input as an integer.

int i = *(int*)&x;         // evil floating point bit level hack

It takes the resulting value and does integer arithmetic on it which produces an approximation of the value we’re looking for:

i = 0x5f3759df - (i >> 1);  // what the fuck?

The result is not the approximation itself though, it is an integer which happens to be, if you reinterpret the bits as a floating point number, the approximation. So the code does the reverse of the conversion in step 1 to get back to floating point:

x = *(float*)&i;

And finally it runs a single iteration of Newton’s method to improve the approximation.

x = x*(1.5f-(xhalf*x*x));

This gives you an excellent approximation of the inverse square root of x. The last part, running Newton’s method, is relatively straightforward so I won’t spend more time on it. The key step is step 2: doing arithmetic on the raw floating-point number cast to an integer and getting a meaningful result back. That’s the part I’ll focus on.

## What the fuck?

This section explains the math behind step 2. (The first part of the derivation below, up to the point of calculating the value of the constant, appears to have first been found by McEniry).

Before we can get to the juicy part I’ll just quickly run over how standard floating-point numbers are encoded. I’ll just go through the parts I need, for the full background wikipedia is your friend. A floating-point number has three parts: the sign, the exponent, and the mantissa. Here’s the bits of a single-precision (32-bit) one:

s e e e e e e e e m m m m m m m m m m m m m m m m m m m m m m m

The sign is the top bit, the exponent is the next 8 and the mantissa bottom 23. Since we’re going to be calculating the square root which is only defined for positive values I’m going to be assuming the sign is 0 from now on.

When viewing a floating-point number as just a bunch of bits the exponent and mantissa are just plain positive integers, nothing special about them. Let’s call them E and M (since we’ll be using them a lot). On the other hand, when we interpret the bits as a floating-point value we’ll view the mantissa as a value between 0 and 1, so all 0s means 0 and all 1s is a value very close to but slightly less than 1. And rather than use the exponent as a 8-bit unsigned integer we’ll subtract a bias, B, to make it a signed integer between -127 and 128. Let’s call the floating-point interpretation of those values e and m. In general I’ll follow McEniry and use upper-case letters for values that relate to the integer view and and lower-case for values that relate to the floating-point view.

Converting between the two views is straightforward:

$m = \frac{M}{L}$

$e = E - B$

For 32-bit floats L is 223 and B is 127. Given the values of e and m you calculate the floating-point number’s value like this:

$(1+m)2^e$

and the value of the corresponding integer interpretation of the number is

$M + LE$

Now we have almost all the bits and pieces I need to explain the hack so I’ll get started and we’ll pick up the last few pieces along the way. The value we want to calculate, given some input x, is the inverse square root or

$y = \frac{1}{\sqrt{x}} = x^{-\frac 12}$

For reasons that will soon become clear we’ll start off by taking the base 2 logarithm on both sides:

$\log_2 y = {-\frac 12}\log_2 x$

Since the values we’re working with are actually floating-point we can replace x and y with their floating-point components:

$\log_2 (1+m_y) + e_y = {-\frac 12}(\log_2 (1+m_x) + e_x)$

Ugh, logarithms. They’re such a hassle. Luckily we can get rid of them quite easily but first we’ll have to take a short break and talk about how they work.

On both sides of this equation we have a term that looks like this,

$\log_2(1 + v)$

where v is between 0 and 1. It just so happens that for v between 0 and 1, this function is pretty close to a straight line:

Or, in equation form:

$\log_2(1 + v) \approx v + \sigma$

Where σ is a constant we choose. It’s not a perfect match but we can adjust σ to make it pretty close. Using this we can turn the exact equation above that involved logarithms into an approximate one that is linear, which is much easier to work with:

$m_y + \sigma + e_y \approx {-\frac 12}(m_x + \sigma + e_x)$

Now we’re getting somewhere! At this point it’s convenient to stop working with the floating-point representation and use the definitions above to substitute the integer view of the exponent and mantissa:

$\frac{M_y}{L} + \sigma + E_y - B \approx {-\frac 12}(\frac{M_x}{L} + \sigma + E_x - B)$

If we shuffle these terms around a few steps we’ll get something that looks very familiar (the details are tedious, feel free to skip):

$\frac{M_y}{L} + E_y \approx {-\frac 12}(\frac{M_x}{L} + \sigma + E_x - B) - \sigma + B$

$\frac{M_y}{L} + E_y \approx {-\frac 12}(\frac{M_x}{L} + E_x) - \frac{3}{2}(\sigma - B)$

$M_y + LE_y \approx {\frac 32}L(B - \sigma) - {\frac 12}(M_x + LE_x)$

After this last step something interesting has happened: among the clutter we now have the value of the integer representations on either side of the equation:

$\mathbf{I_y} \approx {\frac 32}L(B - \sigma) - {\frac 12}\mathbf{I_x}$

In other words the integer representation of y is some constant minus half the integer representation of x. Or, in C code:

i = K - (i >> 1);

for some K. Looks very familiar right?

Now what remains is to find the constant. We already know what B and L are but we don’t have σ yet. Remember, σ is the adjustment we used to get the best approximation of the logarithm, so we have some freedom in picking it. I’ll pick the one that was used to produce the original implementation, 0.0450465. Using this value you get:

${\frac 32}L(B - \sigma) = {\frac 32}2^{23}(127 - 0.0450465) = 1597463007$

Want to guess what the hex representation of that value is? 0x5f3759df. (As it should be of course, since I picked σ to get that value.) So the constant is not a bit pattern as you might think from the fact that it’s written in hex, it’s the result of a normal calculation rounded to an integer.

But as Knuth would say: so far we’ve only proven that this should work, we haven’t tested it. To give a sense for how accurate the approximation is here is a plot of it along with the accurate inverse square root:

This is for values between 1 and 100. It’s pretty spot on right? And it should be – it’s not just magic, as the derivation above shows, it’s a computation that just happens to use the somewhat exotic but completely well-defined and meaningful operation of bit-casting between float and integer.

## But wait there’s more!

Looking at the derivation of this operation tells you something more than just the value of the constant though. You will notice that the derivation hardly depends on the concrete value of any of the terms – they’re just constants that get shuffled around. This means that if we change them the derivation still holds.

First off, the calculation doesn’t care what L and B are. They’re given by the floating-point representation. This means that we can do the same trick for 64- and 128-bit floating-point numbers if we want, all we have to do is recalculate the constant which it the only part that depends on them.

Secondly it doesn’t care which value we pick for σ. The σ that minimizes the difference between the logarithm and x+σ may not, and indeed does not, give the most accurate approximation. That’s a combination of floating-point rounding and because of the Newton step. Picking σ is an interesting subject in itself and is covered by McEniry and Lomont.

Finally, it doesn’t depend on -1/2. That is, the exponent here happens to be -1/2 but the derivation works just as well for any other exponent between -1 and 1. If we call the exponent (because e is taken) and do the same derivation with that instead of -1/2 we get:

$\mathbf{I_y} \approx (1 - p)L(B - \sigma) + p\mathbf{I_x}$

Let’s try a few exponents. First off p=0.5, the normal non-inverse square root:

$\mathbf{I_y} \approx K_{\frac 12} + {\frac 12}\mathbf{I_x}$

$K_{\frac 12} = {\frac 12}L(B - \sigma) = {\frac 12}2^{23}(127 - 0.0450465) = \mathtt{0x1fbd1df5}$

or in code form,

i = 0x1fbd1df5 + (i >> 1);

Does this work too? Sure does:

This may be a well-known method to approximate the square root but a cursory google and wikipedia search didn’t suggest that it was.

It even works with “odd” powers, like the cube root

$\mathbf{I_y} \approx K_{\frac 13} + {\frac 13}\mathbf{I_x}$

$K_{\frac 13} = {\frac 23}L(B - \sigma) = {\frac 23}2^{23}(127 - 0.0450465) = \mathtt{0x2a517d3c}$

which corresponds to:

i = (int) (0x2a517d3c + (0.333f * i));

Since this is an odd factor we can’t use shift instead of multiplication. Again the approximation is very close:

At this point you may have noticed that when changing the exponent we’re actually doing something pretty simple: just adjusting the constant by a linear factor and changing the factor that is multiplied onto the integer representation of the input. These are not expensive operations so it’s feasible to do them at runtime rather than pre-compute them. If we pre-multiply just the two other factors:

$L(B - \sigma) = 2^{23}(127 - 0.0450465) = \mathtt{0x3f7a3bea}$

we can calculate the value without knowing the exponent in advance:

i = (1 - p) * 0x3f7a3bea + (p * i);

If you shuffle the terms around a bit you can even save one of multiplications:

i = 0x3f7a3bea + p * (i - 0x3f7a3bea);

This gives you the “magic” part of fast exponentiation for any exponent between -1 and 1; the one piece we now need to get a fast exponentiation function that works for all exponents and is as accurate as the original inverse square root function is to generalize the Newton approximation step. I haven’t looked into that so that’s for another blog post (most likely for someone other than me).

The expression above contains a new “magical” constant,  0x3f7a3bea. But even if it’s in some sense “more magical” than the original constant, since that can be derived from it, it depends on an arbitrary choice of σ so it’s not universal in any way. I’ll call it Cσ and we’ll take a closer look at it in a second.

But first, one sanity check we can try with this formula is when p=0. For a p of zero the result should always be 1 since x0 is 1 independent of x. And indeed the second term falls away because it is multiplied by 0 and so we get simply:

i = 0x3f7a3bea;

Which is indeed constant – and interpreted as a floating-point value it’s 0.977477 also known as “almost 1” so the sanity check checks out. That tells us something else too: Cσ actually has a meaningful value when cast to a float. It’s 1; or very close to it.

That’s interesting. Let’s take a closer look. The integer representation of Cσ is

$C_\sigma = L(B - \sigma) = LB - L\sigma$

This is almost but not quite the shape of a floating-point number, the only problem is that we’re subtracting rather than adding the second term. That’s easy to fix though:

$LB - L\sigma = LB - L + L - L\sigma = L(B - 1) + L(1 - \sigma)$

Now it looks exactly like the integer representation of a floating-point number. To see which we’ll first determine the exponent and mantissa and then calculate the value, cσ. This is the exponent:

$e_{c_\sigma} = (E_{C_\sigma} - B) = (B - 1 - B) = -1$

and this is the mantissa:

$m_{c_\sigma} = \frac{M_{C_\sigma}}{L} = \frac{L(1 - \sigma)}{L} = 1 - \sigma$

So the floating-point value of the constant is (drumroll):

$c_\sigma = (1 + m_{c_\sigma})2^{e_{c_\sigma}} = \frac{1 + 1 - \sigma}2 = 1 - \frac{\sigma}2$

And indeed if you divide our original σ from earlier, 0.0450465, by 2 you get 0.02252325; subtract it from 1 you get 0.97747675 or our friend “almost 1” from a moment ago. That gives us a second way to view Cσ, as the integer representation of a floating-point number, and to calculate it in code:

float sigma = 0.0450465;
float c_sigma = 1 - (0.5f * sigma);
int C_sigma = *(*int)&c_sigma;

Note that for a fixed σ these are all constants and the compiler should be able to optimize this whole computation away. The result is 0x3f7a3beb – not exactly 0x3f7a3bea from before but just one bit away (the least significant one) which is to be expected for computations that involve floating-point numbers. Getting to the original constant, the title of this post, is a matter of multiplying the result by 1.5.

With that we’ve gotten close enough to the bottom to satisfy at least me that there is nothing magical going on here. For me the main lesson from this exercise is that bit-casting between integers and floats is not just a meaningless operation, it’s an exotic but very cheap numeric operation that can be useful in computations. And I expect there’s more uses of it out there waiting to be discovered.

(Update: I’ve added an appendix about the intuition behind the part where I shuffle around the floating-point terms and the integer terms appear.)

## Reload?

This post is about one of those messages you see every day of your life as a programmer, your editor telling telling you: this file has changed, would you like to reload or keep your changes? The problem with this particular message is that usually neither option is what you want. You’ve made changes to that file for a reason and you neither want to lose those changes or go out of sync with the file system. There’s a lot of new editors about these days (which is fantastic) but I’ve yet to see one make an attempt to solve this problem. This blog post gives some thoughts on how it could be solved.

## The problem

The underlying issue here is that editors assume they have exclusive access to all files you are editing. Some editors including eclipse and emacs consider you to be editing a file as soon as you’ve opened it, before you’ve even made any changes. And once you have unsaved changes I don’t know of any editors that are forgiving if the file changes on disk in the meantime.

This assumption, that an editor can have exclusive access to a file, doesn’t seem sound. Just because you have a file open in an editor doesn’t mean it can’t change. If you’re working on a busy code base there could be people editing the same files as you at any time. Usually this is controlled through version control so you have control over when you sync and can make sure you don’t have unsaved changes when you do, but this is something you have to manage yourself. You get no help from your editor. Worse is if you’re working on something time critical, like configuration files that control a live application. These kinds of files can be very busy and you’ll want to both stay in sync and be able to queue up and submit changes quickly. In both cases, but especially the latter, the revert/override choice is really unhelpful.

And these problems is normal old-fashioned use of source code. Besides the emergence of new editors people are also busy working on web-based IDEs. This further undermines the assumption that an editor can own a file. As soon as you’re on the web it becomes much easier to share and collaborate. It may be desirable to support more fine-grained real-time collaboration in addition to the more coarse-grained version control model we’re used to, where modifications come in larger self-contained changesets. Real-time collaboration is not appropriate in many cases, but there are cases where it makes sense and it definitely shouldn’t be the capabilities of the editor that decides whether it’s feasible. If you want to use it the editor should support it, on the desktop or on the web.

## Living source files

The good news is that this issue has long been solved in the context of collaborative online tools like google docs and google wave (I know there are loads of others but these are the ones I know best). These kinds of tools can’t assume that you have exclusive access to the document you’re editing, there can be any number of entities modifying a document at the same time. They also can’t assume that what you’re seeing is the current contents of the document. In fact in this setting the concept of an authoritative “current document” isn’t even well defined – as long as each participant keeps editing and their changes take a while to propagate the participants may all see slightly different versions of the document, none of which is more authoritative than the others. Only once everyone has stopped editing and everyone has caught up to each others changes does the document stabilize and you can talk about authoritative current contents of the file.

This model may sound exotic and not appropriate for editing local files, and it may not be. But the underlying assumptions of those models, that what you’re seeing may not reflect the current contents of the file and that those contents may change at any time, are good assumptions to make even for local file editors. The next part is about how to make an editor work under those assumptions.

## OT

How do online collaborative tools work then? The ones I’m familiar with use operational transformation, or OT. The way OT works is by changing the focus from the question what are the contents of this file? to what operations have been performed on the contents of this file? This slight change in viewpoint makes a huge difference.

In a (very simplified) non-OT model your interaction with a source file in an editor works something like this:

• When a file is opened, read the current contents into a buffer.
• Any changes the user makes are applied directly (destructively) to the buffer.
• When user asks for the file to be saved the current contents of the buffer are written directly into the file. If the file has changed those changes would be lost so throw up a revert/override dialog and let the user choose.

In the OT model the interaction would work something along these lines. (This is a very rough outline, a proper OT model would be much cleaner than this makes it sound):

• When a file is opened, read the current contents into a buffer.
• Any changes the user makes are stored as operations, essentially a diff against the original file contents.
• When the user asks the editor to save the file it does the following:
• Read the current contents of the file.
• Generate a three-way diff between the original contents, the current contents, and the pending operations.
• Apply the resulting diff to the current file contents.
• Save the results.

While this isn’t completely accurate it gives a sense of the basics: what you’re doing is not editing a file, though it looks just like it, you’re building a set of changes against the file contents from one point in time. The changes you build may be applied to the original file contents, or they may be applied to a different version if the file changes. This model is much more robust to the way people actually work with source code. And though you may not always want your editor to merge changes on the fly when you save wouldn’t it be nice if it would at least give you the option?

Beyond just saving this model has several other nice properties. When the system notices that a file has changed, even before you want to save it, it can update your buffer by doing essentially the same as when saving except that instead of applying your pending changes to the current file contents it would apply the changes between the old and current file contents to the buffer containing your pending changes. It also allows you to undo saving since saving is equivalent to applying a diff and you can undo that, even when there’s been subsequent changes, just by reverting the diff.

There are probably numerous other solutions to this issue. I’m most familiar with OT because I worked with when I was on the wave project and I know it works extremely well. (I wish I had a good place to point anyone interested in learning the basics but I don’t, only the wikipedia page listed above). My main point is that this issue should be solved somehow, and it’s a problem that’s been solved in other contexts. I think it’s about time those solutions make their way to desktop editors.

## Stress

Came across an article about information overload on google+, which was very apropos of my recent efforts to not be so stressed at work. I’ve made some changes to how I work over the last few months, some of them just tweaks actually, and I feel like they’ve improved my ability to focus and made work much less stressful. YMMV, obviously.

• I use virtual desktops to minimize distraction. I have three virtual desktops: one desktop with just what I’m working on right now, nothing else, one with email, code reviews, and other things that might distract me, and one with personal stuff like personal email. I can easily switch desktops and check emails, but the only thing that can impose on me are chat popups, and people don’t chat me a lot. I occasionally check for new important email or code reviews but I can choose when, typically while I’m waiting for something on my work desktop to complete.
• I keep the tasks I’m working on written down and categorized according to when I want to complete them. I pick one task from the stuff-I-should-do-soon pile and work just on that. When that pile is empty, or it’s been a while since the last time, I look over all my tasks, usually a quarter’s worth, and move some out of and some into the soon-pile. This means I don’t have to worry about all the things I’m not working on right now because I know I’ve considered what I could be working on and chosen what I’m doing right now as the most important. This has been very helpful in making me less stressed. This is apparently a well-known approach, I first saw it in this video about how to use Asana (Using Asana for Individual Task Management).
• I make an effort to only work 8 hours a day. This is a tough one because it sometimes makes me feel like I’m slacking off, especially because I come in early so I’m almost always the first to go home. However, there’s plenty of research suggesting that even though you may feel you’re getting more work done working 10 hours a day it’s an illusion and you’re actually doing less than if you worked 8 hours (see why crunch mode doesn’t work). At first I just trusted the research, against my own intuition, but at this point I’m convinced — I feel much more focused and efficient and I see no evidence that I’m being less productive. I will occasionally work more for a while but I always feel a “overwork hangover” afterwards where I’m tired, less focused, more likely to space out. But at this point I know that and can make a conscious choice about whether it’s worth it.

Some of these require a bit of motivation; the one really easy thing is to clear any distractions from your workspace, especially if like me you were silly enough to have a browser window there. It’s easy and it makes a big difference.

## Monitoring

One really useful programming tool I use whenever it’s available is exported stats monitoring. It’s one of those things that has no standard definition and which people reimplement over and over in different ad-hoc ways. This post is about a monitoring tool for JavaScript programs running in chrome I implemented recently, WebMon.

## WebMon

WebMon is a very simple tool conceptually; this image tells you all you need to know to use it:

It has two parts: a JavaScript library, webmon.js, and a chrome extension, WebMon. In your JavaScript code you export a stat by creating an instance of the webmon.Counter type which is provided by the library. During execution you update the stats as appropriate using the methods on the Counter instances. The WebMon chrome extension detects any page that uses the library and can show you a popup that shows the current value of each exported stat, updated continuously. You can also specify some simple computations on the data, for instance here the frame rate counter is configured to display the rate per second rather than the count. To see this in action try installing the chrome extension and going to this test page. The extension needs pretty broad permissions to see stats on any web page so for the paranoid you may want to skim the source code before installing it.

The motivation for implementing this came from a WebGL-based hobby project I’m working on. I needed to keep track of the frame rate and rendering CPU load and my initial implementation which displayed those in a div on the page was just not working out. WebMon gives you the same functionality with a minimum of complexity and performance impact on the JavaScript side since almost all the code is in the chrome extension. It also doesn’t require you to clutter your page with debug info.

That’s all there is to it. The rest of this post will be about exported stats in general.

## Monitoring

WebMon is just one example of the more general mechanism of exported stats. The general model is: a program exports a set of stats which it updates during execution. These stats are visible to some form of external processor/viewer which can record or at least see their value at any time. A simple example is chrome. If you run chrome with the --enable-stats-table command-line flag you can go to chrome://stats and see the current value of a number of chrome’s internal counters and timers. V8 has a similar mechanism but since v8 doesn’t have a UI that can easily display the stats I wrote a python script that continuously reads the counter values and displays them in a window. I used it all the time when working on the v8/chrome integration layer since it allowed me to see exactly how many dom node wrappers were live, exactly what had happened during the last garbage collection, etc.

Much of this you could also get by printing debug information and processing it either manually or, for more complex output, using separate programs. A place this works really well is dumping debug information after garbage collections in a virtual machine. Garbage collections happen rarely enough that it’s not a lot of overhead and you’re likely to want information about every collection, not just the last one. Compared to logging the advantage of using an exported stat is that updating it is cheaper — in the v8 implementation it’s just writing a single word in memory — and it scales well. If you log information about 100 different conditions your log becomes huge and difficult to process, and the I/O slows your program down. With exported stats the external processor or viewer can easily filter the stats for you and only show the ones you’re interested in. Also, the space for each export stat is constant, typically just a single word containing the stat and another few words of metainformation. Another advantage of exported stats is that they’re testable — you can easily read back and test the value of a counter whereas testing that the right thing is printed in a debug log is tricky.

Really I think each programming language should come with stats exporting and monitoring built in or as a canonical library. Or even better, since monitoring and viewing can (and should)  be a separate process there could be a single common monitoring tool with a standard protocol that would have bindings for each language.

Until that happens, if you’d like to know what is happening in your JavaScript program as it happens you should give WebMon a try.

## Gray

It turns out that Google’s generous supply of free snacks to employees does more than just keep us fat and happy, my efforts to keep my snack intake down led to some interesting math too. In this post I’ll have a bit of fun with Gray code, prove wikipedia wrong (shock!), and get healthier by eating fewer snacks.

It recently occurred to me that my snack counting system was in unary: one piece of sticky tack for one snack. As a self-respecting computer scientist obviously I find this unacceptable – unary is okay for some very limited uses but binary should really always be your first choice. Clearly I had to change to a binary system.

The problem with standard binary numbers though is that I don’t want to have to move to many dots around. It sucks when I’m having my fourth snack of the week that I have to move all 3 dots to go from 011 to 100. Luckily there is a solution to this: Gray code, which allows you to count as efficiently as normal binary but only flips one bit at a time.

To use Gray code though I needed an easy way to count from one number to the next. It’s fine to know that I only need to flip one bit but I also need to know which one to flip. Skimming the wikipedia page I found this recipe for counting:

[…] at step i find the bit position of the least significant ‘1’ in the binary representation of i – flip the bit at that position in the previous code to get the next code.

This is actually incorrect (more on that below) but I didn’t know that when I first read it. I just though “that sounds like a lot of work, surely it doesn’t have to be that difficult”. And it doesn’t if you look at how Gray code1 is constructed.

Gray code is constructed recursively. Given the sequence of Gray codes that count from 0 to n, for instance here are the values that count from 0 to 3:

00 01 11 10

you construct the set of numbers counting from 0 to 2n by adding a zero in front of the previous sequence, using that to count up to n, and replacing the 0 with a 1 and counting the previous sequence backwards to 0:

000 001 011 010 110 111 101 100

This is a really neat construction and makes it easy to prove inductively that they have some of their nice properties: that you always change exactly one bit, that you never hit the same value twice, etc.

This construction also means you can increment a Gray coded number recursively, using this procedure:

1. If the first digit is 0 and the remaining digits can be incremented, increment them.
2. If the first digit is 0 and the remaining digits are already at their largest possible value, flip the first digit.
3. If the first digit is 1 decrement the remaining digits.

The procedure for decrementing is the same, just inversed. This may sound complicated but can be phrased differently in a way that makes it easier to use in practice.

• You want to flip exactly one bit. Initially you want to flip a bit up, that is, from 0 to 1.
• You start from the far left, going right, and you must flip the least significant bit that can be flipped the way you want. You can’t flip a 0-bit down since it’s already down, and you can’t flip a 1-bit up.
• When you meet a 1 you change your mind about which way you want to flip, starting at the next bit. If you wanted to flip up before, going forward you’ll want to flip down and vice versa. But only from the next bit.

I call this the flip-flop2 algorithm because you keep changing your mind about which way you want to flip the bit. Here’s an example of how it works.  Say you want to increment this Gray coded number (it’s happens to be 115 but the concrete value is irrelevant):

01001010

You start out wanting to flip a bit up. The first bit is 0 so you can up it, which makes it a candidate for the bit to flip. We’ll mark it like this:

01001010

The arrow goes up because we want to flip the bit up, and it’s green because it’s possible to flip it. Moving left, we also want to flip the next bit up but it’s already a 1 so we can’t and we make that one red.

01001010

Moving left again, since the last bit we saw was a 1 we now change our minds and want to flip a bit down. The next two bits are both 0 so we can’t down those and mark them red too:

01001010
↓↓

The next bit is a 1 which we can down,

01001010
↓↓

and as usual we change our mind and now want to up. The next bit is a 0 which we can up

01001010
↓↓

And so on: the next bit we can’t up and the last bit we can’t down so both become red. The least significant bit we could flip the way we’d want is the third so that’s the one we’ll flip. The end result is this:

01001010 → 01001110
↓↓

In practice it’s easy to do. You just scan the number from left to right, keeping track of operation you currently want to do and what the last position was where you could do an operation. Here’s a sequence of the Gray numbers from 115 to 122 with the procedure arrows like above; the bold digit is the one that was changed in the previous step:

01001010 → 01001110 → 01001111 → 0100110
↓↓↓   ↓↓   ↓↓   ↓↓

01001100 → 01000100 → 01000101 → 01000111
↓↓   ↓↓   ↓↓   ↓↓

Besides being a neat way to increment a Gray code the flip-flip algorithm has a few interesting quirks that are worth taking a closer look at. One thing you may have noticed is that each arrow in the diagrams have two binary properties: they each have a direction, up or down, and they each have a color, red or green. If you read these properties as bits they form two different numbers: the direction value where up is 0 and down is 1, and the color value where green is 0 and red is 1. Here is an example:

01001010       Gray code
↓↓
01110011  115  Color value
00111001  57   Direction value

If you think the color value, 115, looks familiar that’s because is is: it’s the binary representation of the Gray code. I found that pretty surprising when I noticed it. If you look at the color and direction values next to each other it’s also clear that the direction is actually the color right-shifted once. Here are the the color and direction values spelled out for the sequence above:

01001010 → 01001110 → 01001111 → 0100110
↓↓↓   ↓↓   ↓↓   ↓↓
01110011   01110100   01110101   01110110
00111001   00111010   00111010   00111011

01001100 → 01000100 → 01000101 → 01000111
↓↓   ↓↓   ↓↓   ↓↓
01110111   01111000   01111001   01111010
00111011   00111100   00111100   00111101

The intuitive reason for this becomes clearer if you interpret the procedure slightly differently. In the original phrasing you want to either change a 0 to 1 or a 1 to 0, and you keep changing your mind about which you want to do. An alternative view is to say that you always want to make a 0 into a 1, but you change your mind about what is 0 and what is 1. After a 1 you now interpret a 0 in the input as a 1, and vice versa. That way, in the cases where you can make the change you want, that is where the arrow is green, the bit must have been what you considered 0. And in the cases where the arrow is red the bit must have been what you considered 1 at that point.

The view of flipping the interpretation of bits actually matches the way Gray code is constructed. To see this consider how you increment from 7 to 8 in Gray and standard binary. With standard binary you go from 0111 to 1000, flipping all four bits. In Gray you go from 0100 to 1100 which is in some sense equivalent except that when you flip the highest bit you don’t have to explicitly also flip the lower bits because they are changed implicitly by changing your interpretation of 0 and 1. Viewing the numbers and the algorithm this way makes it easier to understand intuitively that the bits where you can make the change you want must correspond to a 0 in the binary representation and the bits where you can’t must correspond to a 1. Under this view the direction value tracks your view: if a direction bit is 0 you view the bits normally, if it is 1 you view them in inverse.

But wait, there’s more. The simplest way to Gray encode a binary number n is using xor:

int gray(int n) {
return n ^ (n >> 1);
}

As we saw before the color value above is n and the direction is n right-shifted once. So in fact if you xor together the color and direction values you get the value you started with. And if you check with the examples above that is indeed the case. Intuitively, since the direction value determined how view view the bits, if there is a 0 in the direction value the corresponding bit in the Gray value is the same as in the binary value. If the direction value has a 1 you have to flip the bit in the Gray code to get the binary value. This is exactly how xor works, so xoring the direction value and the Gray value will give you the binary value. And, since xor is associative, xoring the direction and the binary value gives you the Gray value.

There is one point left to revisit which is the recipe given in the wikipedia article for incrementing a Grey code. Here it is again:

[…] at step i find the bit position of the least significant ‘1’ in the binary representation of i – flip the bit at that position in the previous code to get the next code.

We can now tie this together with the flip-flop algorithm. Since the standard representation of i is the same as the color value, what this is saying is: find the least significant 1-bit in the color value, which is the least significant red arrow, and flip it. This seems wrong – surely you’re supposed to flip the least significant green arrow? Let’s try their algorithm.

Take a random number, say the Gray code for 4, 110. The binary representation of 4 is 100. The least significant (and only) 1 is in the most significant bit position. Flipping that in the Gray code gives you 010. This is actually the code for 3. So what seemed wrong intuitively, flipping the red arrow, was indeed wrong.

Instead what the flip-flop algorithm suggests is to flip the least significant green arrow, or 0-bit in the standard representation. Let’s try that. The least significant 0 in 100 is at the least significant bit position. Flipping that in the Gray code gives 111, which is indeed the code for 5. So it works in the example, and it works in general. Someone should probably fix that wikipedia article.

1: Technically there’s more than one encoding that can be called Gray code but people usually mean binary-reflected Gray code and that’s what I mean too.

2: I have no doubt this algorithm is well known so the fact that I’m naming it doesn’t mean I’m claiming to have invented something new – it’s just convenient to have a name for it

## Tracing Promises

Ever so often I find myself having to write JavaScript programs that interact with the world through XMLHttpRequests. And every time the first thing I do before I actually get to send off requests is implement a small promise framework. Promises are just so much more powerful than callbacks that at this point I wouldn’t even consider writing raw callback-based asynchronous code.

One thing about promises though, especially if you program is just moderately complex, is that they make it much more difficult to follow the flow of your code. Everything happens asynchronously in short callbacks and when an error occurs it can be near impossible to piece together what happened. This post is about how to solve that problem.

## Promises in general

Before getting to the part about tracing them here are some examples of uses of promises, to set the stage for the kind of operations we’ll later want to trace. These examples use idioms from my own promise library but these considerations apply to any of the libraries out there, the concrete patterns may just be different.

One place I relied heavily on promises was in the implementation of the sputnik test web runner (which has now been subsumed by test262). The function for firing off an XHR using google’s closure library looked like this:

function makeRequest(path) {  var result = new Promise();  goog.net.XhrIo.send(path, function () {    result.fulfill(this.getResponse());  }, "GET", undefined, undefined, 0);  return result;}// Fetch google.com and alert the result.makeRequest("http://www.google.com")  .onFulfilled(alert);

What this function does is return a promise object that gets fulfilled when the XHR, which is sent off at the same time, completes and invokes its callback. This may look like a whole lot of nothing – basically making a callback into a promise – but the advantage comes at the call site. Say you have two mirrors of the same data and want to send a request to both and use the first response you get back. With promises that is trivial:

// Fetch from both mirrors, use the first response.var response = Promise.select(makeRequest(mirrorOne),    makeRequest(mirrorTwo));response.onFulfilled(alert);

The select operator takes any number of promises and returns a new promise that resolves as soon as any of the child promises resolve. Another use for select to set timeouts:

// Fetch path but time out after one second.var response = Promise.select(makeRequest(path),  Promise.failAfter("Promise timed out", 1000));// Alert successful response, log any errors in the console.response    .onFulfilled(alert)    .onFailed(console.log.bind(console));

Here the response will resolve either to the result of the request, if it happens within one second, or else will fail with the string "Promise timed out" when the result of failAfter fails after one second.

To get the results of several concurrent promises you can use join. This code sends off two XHRs at once, waits until they’re both complete, and then alerts the concatenation of the results.

// Fetch pathOne and pathTwo in parallel.var request = Promise.join(makeRequest(pathOne),    makeRequest(pathTwo));// When done, concatenate the two responses.var response = request.thenApply(function (dataOne, dataTwo) {  return dataOne + dataTwo;});// Finally alert the result.response.onFulfilled(alert);

The join operator takes a list of promises and returns a promise that resolves to the list of values of all the child promises once they’re been resolved. The thenApply operator on a promise returns a new promise whose value will be the result of applying the function to the value of the original promise.

Note that these examples are a lot more verbose than you would usually make such code. I wrote them like that to make it clearer what’s going on. Some operations would be packed into convenience methods, like setting a timeout, and you would generally use method chaining. With that the last example, with the addition of a three second timeout, would look like this:

var response = Promise    .join(makeRequest(pathOne), makeRequest(pathTwo))    .thenApply(function (a, b) { return a + b; })    .withTimeout(3000)    .onFulfilled(alert)    .onFailed(console.log.bind(console));

Packs quite a punch doesn’t it? This is what I mean when I say that promises are more powerful than callbacks; it’s not that one is more expressive than the other, they’re computationally equivalent, but in terms of conciseness and power of abstraction there’s just no competition. And this is just a small part of what you can do with promises. A good place to see the power of this style of programming in action is twitter’s finagle stack.

## The Errors! The Errors!

All’s well with this model as long as all operations succeed. However, as soon as operations start failing this is a mess. Consider for instance the case where you have numerous concurrent operations running that have timeouts. Errors are propagated through promises nicely so if you’ve set up a failure handler you’ll see the error, but it will look something like this:

Operation timed out    at promise.js:108:18

And that’s in the best case where error objects have stack traces. Since each step in the chain of promises is detached and runs in a separate turn of the event loop all you’ll see is that some timeout somewhere fired. There’s no telling which operation took too long. As soon as your program is just moderately complex this becomes a major issue. It’s like with normal synchronous programs if there were no stack traces. Stack traces are there for a reason: it sucks to debug without them.

The stack trace analogy is useful in understanding what information you’d want to have, if you could, when a promise fails. All these patterns of asynchronous operations correspond, when you’re only interested in error reporting, to simple call/return patterns we’re used to. For instance, the join operation is just an asynchronous version of taking a list of functions, calling each one, and returning a list of their results:

function syncJoin(funs) {  var result = [];  for (var i = 0; i ... funs.length; i++)    result.push(funs[i]());  return result;}

In this case if calling any of the functions fails then syncJoin will fail and both calls will show up in the stack trace. The same thing holds for join on promises: if join fails because one of the child promises fail you want information about both. The same thing holds for select, for the same reason.

This all suggests that when a promise fails and propagates that failure through a number of other promises what you as a programmer need is information about each promise in that chain. Something like:

Operation timed out    at promise.js:108:18--- triggering the failure of ---    at Function.withTimeout (promise.js:106:17)    at runDemo (demo.js:226:11)    at demo.html:7:168    at onload (demo.html:8:4)--- triggering the failure of ---    at Function.select (promise.js:50:16)    at Function.withTimeout (promise.js:110:18)    at runDemo (demo.js:226:11)    at demo.html:7:168    at onload (demo.html:8:4)

This is a trace created using the promise trace library I wrote last week to trace down some issues with a chrome extension I was working on. A promise trace is made up of segments, each one a stack trace identifying a promise. The top segment is the error that caused the failure, in this case exactly the same timeout as before. The next one is the promise created by withTimeout that fails when the timeout is exceeded. The bottom segment is the select between the operation and the timeout, and going to demo.js:226 will tell you which operation it was.

The trace above is from chrome; here’s the result of the same failure but in firefox

Operation timed out--- triggering the failure of ---GenericTraceSegment()@trace.js:197Promise()@promise.js:23([object Object],100)@promise.js:106runDemo()@demo.js:226onload([object Event])@demo.html:1--- triggering the failure of ---GenericTraceSegment()@trace.js:197Promise()@file:promise.js:23([object Object],[object Object])@promise.js:50([object Object],100)@promise.js:110runDemo()@demo.js:226onload([object Event])@demo.html:1

It’s a bit more cluttered since the firefox stack trace api doesn’t support stripping irrelevant top frames but both contain the relevant clues you need when debugging.

Both traces contain a fair amount of redundant information. For instance, the two bottom promises are created close to each other so the bottom of their stack traces are identical. To make the traces easier to read the promise trace library folds away overlapping stack traces. The actual trace that would be printed for the above error is

Operation timed out    at promise.js:108:18--- triggering the failure of ---    at Function.withTimeout (promise.js:106:17)    at runDemo (demo.js:226:11)    at demo.html:7:168    at onload (demo.html:8:4)--- triggering the failure of ---    at Function.select (promise.js:50:16)    at Function.withTimeout (promise.js:110:18)    at runDemo (demo.js:226:11)    (... rest of trace same as previous ...)

This is probably still more cluttered and redundant that it needs to be – in my experience for each segment you only need to know the top frame below the call into the promise library and the bottom frame that is different from the previous segment. But I’m not sure that’s true in all cases so for now this is as small as I’d like to make it.

## Getting promise traces

How does this api work? From a user’s perspective all you see is that on failure your callback is given two values: the error and a trace. You’d typically use it something like this:

myOperation.onFailed(function (error, trace) {  console.log(trace.toString());});

This requires some support from the promise library but not a lot actually. Whenever a promise is created it has to create a PromiseTraceSegment value and when propagating a failure it should create and propagate a chain of PromiseTrace objects which collect the relevant segments. In my own promise library the tracing code makes up maybe ten lines.

## Performance

The way this is implemented, as is probably evident from the examples above, is by capturing a stack trace every time a promise is created (there’s also a flag to disable trace capturing altogether). The whole thing is actually pretty straightforward, the whole trace library is ~200 lines of which most are for formatting the output. Capturing stack traces is not super expensive generally but even relatively cheap operations add up if you do them often enough. To get a sense for  how expensive tracing is I wrote a simple ludicrously asynchronous implementation of the Fibonacci function:

function lazyFib(n) {  if (n == 0 || n == 1) {    return Promise.of(1);  } else {    return Promise.deferLazy(function () {      return Promise          .join(lazyFib(n - 2), lazyFib(n - 1))          .thenApply(function (a, b) { return a + b;});    });  }}

I then tried running it with and without promise tracing. In chrome tracing makes it around 60% slower on average. In firefox the cost is only around 20% but then this program was a lot slower overall. I believe this is because in firefox the event queue runs at a slower rate by design. In safari you can enable tracing and it’s essentially free but in the version of safari I tried you can’t actually get a stack trace so the promise traces produced are useless. I believe stack trace support is coming to safari too though.

This example is the absolute worst case performance cost; nobody would write as promise-heavy a program as that example. To also get a sense for the impact on a more realistic program I wrote a larger benchmark, one that simulates an RPC system with a string repository that holds a large number of strings which have to be fetched asynchronously, a few at a time, and then processed by a chain of other asynchronous operations. It was still ridiculously asynchronous, much more than any realistic program, but at least not quite as trivial. This time the performance cost in chrome was 0.8% on average. Statistically significant but tiny.

## Conclusion

Debugging asynchronous JavaScript programs is difficult. However, adding support to libraries that implement asynchronous abstractions like promises is straightforward and the overhead of collecting traces is likely to be insignificant for any realistic JavaScript program.