Double

Many (most?) virtual machines for dynamically typed languages use tagged integers. On a 32-bit system the “natural” tagsize for a value is 2 bits since objects are (or can easily be) 4-byte aligned. This gives you the ability to distinguish between four different types of values each with 30 bits of data, or three different types where two have 30 bits and one has 31 bits of data. V8 uses the latter model: 31 bit integers, 30 bit object pointers and 30 bit “failure” pointers used for internal bookkeeping. On 64-bit systems the natural tagsize is 3 which allows for more types of tagged values, and inspired by this I’ve been playing around with tagging another kind of value: double precision floating-point numbers.

Whenever you want to tag a value you need to find space in the value for the tag. With objects it’s easy: they’re already aligned so the lowest few bits are automatically free:

oooooooooooooo..oooooooooooooo000 (untagged)
oooooooooooooo..oooooooooooooottt (tagged)

With integers no bits are free automatically so you only tag if the highest bits are the same, by shifting the lowest bits up and adding the tag in the lowest bits, for instance:

0000iiiiiiiiiiiii..iiiiiiiiiiiiii (untagged)
0iiiiiiiiiiiii..iiiiiiiiiiiiiittt (tagged)

The reason both the highest bits 0000 and 1111 work is that when you shift up by 3 there will be one bit of the original 4 left, and when shifting down again sign extension spreads that bit back out over all the top 4 bits.

With doubles it becomes harder to find free bits. A double is represented as one sign bit s, an 11-bit exponent e and a 52 bit fraction, f:

seeeeeeeeeeeffffffffff..fffffffff (untagged)

The value of the number is (-1)s2e1.f.

To get a feel for how this representation works I tried decomposing some different values into into their parts, here using single precision but the same pattern applies to double precision:

value sign exponent fraction
1.0 0 0 1
-1.0 1 0 1
2.0 0 1 1
0.5 0 -1 1
3.0 0 1 1.5
100.0 0 6 1.5625
10000.0 0 13 1.2207031
-10000.0 1 13 1.2207031
0.0001 0 -14 1.6384
0 0 -127 0.0 (denormalized)
NaN 0 128 1.5
Infinity 0 128 1.0

Looking at this table it’s clear that in an interval around ±1.0, starting close to 0.0 on the one side and stretching far out into the large numbers, the exponent stays fairly small. There are 11 bits available to the exponent, that’s a range from -1022 to 1023 since one exponent is used for special values, but for all values between 0.0001 and 10000 for instance its actual value only runs from -14 to 13.

Even though the high bits of the exponent are unused for many numbers you can’t just grab them for the tag. The 11 exponent bits don’t contain the exponent directly but its value plus a bias of 1023, apparently to make comparison of doubles easier. But after subtracting the bias from the exponent it is indeed just a matter of stealing its top three bits of the value (leaving the sign bit in place):

su000uuuuuuuffffffffff..fffffffff (untagged, bias subtracted)
suuuuuuuuffffffffff..fffffffffttt (tagged, bias subtracted)

Using this approach all doubles whose exponent is between -127 and 128 can be tagged. Since single precision numbers use 8 bits for the exponent all numbers between the numerically greatest and numerically smallest single precision numbers, positive and negative, can be represented as tagged doubles. One potential concern is that this encoding does not allow ±0.0, NaNs or ±∞ but I’m not too worried about that; it’s easy to handle 0.0 specially and it’s unclear how problematic the other values are.

What’s the performance of this? The short answer is: I don’t know, haven’t tried it. The long answer is: I can guess and I’m fairly certain that it’s worth it.

One reason it might not be worth it is if most doubles are not covered by this. To test this I instrumented v8 to test each time a heap number was allocated whether the value would fit as a tagged double. Furthermore I disabled some optimizations that avoid number allocation to make sure the allocation routines saw as many numbers as possible. The results after running the v8 benchmarks were encouraging:

+----------------------------------------+-------------+
| Name | Value |
+----------------------------------------+-------------+
| c:SmallDoubleZero | 16149 |
| c:SmallDoubleNaN | 2 |
| c:SmallDoubleInfinity | 32959 |
| c:SmallDoubleOther | 1290 |
| c:SmallDoubleHits | 628660 |

The number in c:SmallDoubleHits is the ones that fit in a tagged double: 92%. Furthermore, of the numbers that could not be represented all but 2% were zero or infinity. The v8 benchmark suite contains two floating-point heavy benchmarks: a cryptography benchmark and a raytracer. Interestingly, the cryptography benchmarks is responsible for almost all the ∞s and the raytracer is responsible for almost all the 0.0s. If a special case was added for 0.0 then 99.3% of all the doubles used in the raytracer would not require heap allocation.

Note: people often ask if we plan to add 64-bit support to v8. Don’t read this as an indication one way or the other. I’m just testing this using v8 because that’s the code base and benchmarks I know best.

Note that this is just a single data point, there’s no basis to conclude that the same holds across most applications. Note also that since v8 uses tagged integers there is some overlap between this an existing optimizations. But since all 32-bit integers can be represented as tagged doubles any nonzero numbers that are not counted because they were recognized as small integers would have counted as tagged doubles.

Another reason it might not be worth it could be that tagging and untagging is expensive; I doubt that it is considering how relatively few operations it takes:

static const uint64_t kBias = 0x3800000000000000ULL;
static const uint64_t kStolenBitsMask = 0x7000000000000000ULL;
static const uint64_t kLowerMask = 0xfffffffffffffffULL;
static const uint64_t kUpperMask = 0x8000000000000000ULL;
static const uint64_t kTagSize = 3;
static const uint64_t kTag = 0x3;

uint64_t double_bits(double value) {
return *reinterpret_cast(&value);
}

double make_double(uint64_t value) {
return *reinterpret_cast<double*>(&value);
}

bool fits_small_double(double value) {
return ((double_bits(value) - kBias) & kStolenBitsMask) == 0;
}

uint64_t tag_small_double(double value) {
uint64_t bits = double_bits(value) - kBias;
return (bits & kUpperMask)
| ((bits & kLowerMask) << kTagSize)
| kTag;
}

double untag_small_double(uint64_t bits) {
uint64_t result = (bits & kUpperMask)
| ((bits >> kTagSize) & kLowerMask);
return make_double(result + kBias);
}

I haven’t benchmarked this so I can only guess about the performance. I have no doubt that testing and tagging a double is faster overall than allocating a heap object but I suspect reading from memory could turn out to be cheaper than untagging. The overall performance also depends on how doubles are used: if most are only used briefly and then discarded then the possible cheaper reads don’t make up for the more expensive allocation. But I’m just guessing here.

In any case I think this is a promising strawman and it would be interesting to see how it performed in a real application. I also suspect there are more efficient ways of tagging/untagging the same class of doubles. The problem is actually a fairly simple one: reduce the highest bits from 01000, 11000, 01111 or 11111 to only two bits and be able to go back again to the same five bits.

Further reading: alternative approaches are here, which is essentially the same as this but has more special cases, and this which goes the other way and stores other values in the payload of NaN values.

One Response to Double

  1. Raffaello Giulietti

    The Gemstone/S 64 Bit Smalltalk Object Server has implemented a similar optimization: there are 2 concrete classes for doubles, one for “immediate” values with 8 bits exponents and one for heap allocated objects with the full 11 bits exponent range. Of course, for most usages this is completely transparent.

Leave a Reply

Your email address will not be published. Required fields are marked *


*