Varfloats

My last post was about a varint format that can hold values from 7 to 2048 bits, LPV256. Here is the format in a nutshell,

   7 bits: 0xxxxxxx
  14 bits: 10xxxxxx xxxxxxxx
  21 bits: 110xxxxx xxxxxxxx xxxxxxxx
  28 bits: 1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx
  35 bits: 11110xxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
  64 bits: 11111000 + 8 b
128 bits: 11111001 + 16 b
256 bits: 11111010 + 32 b
512 bits: 11111011 + 64 b
1024 bits: 11111100 + 128 b
2048 bits: 11111101 + 256 b
   unused: 1111111x

An alternative view of the format though is that it can be a container for any data of variable size. It doesn’t have to just hold integers. For instance, a different thing you might want to store in a variable-size format are floating-point values.

Floating-point values are trickier to store with variable size. With integers it’s common that values happen to be small and you will often be able to automatically get some benefit from a variable-size integer encoding. With floating-point values though you will rarely get small values by accident, whatever your definition of “small” is. If you’re doing a single or double precision computation you’re likely to end up with a value that requires full single or double precision to represent.

Even a constant like 0.1d that looks small or simple isn’t. The real value isn’t the mathematical value 0.1 but the closest double-precision approximation, in this case 0.100000000000000006. That’s tricky to store with a smaller representation than the full 64 bits. There are values that both look and are small, like 0.0, 0.5, and the integers. But you’re less likely to run into those by accident.

For this reason the benefit of varfloats is different from that of varints. You’re unlikely to automatically save space. If you want a 0.1 that doesn’t take up much space you have to explicitly discard precision to make it small. You might use single precision 0.1f instead of double precision 0.1d. That gives you a different approximation, one that fits into 32 bits, in this case 0.10000000149. Or you might go even lower, to half or 8-bit precision. In this case the benefit of varfloats is that if you explicitly ensure that your value fits in a smaller representation then the encoding will detect that and use less space. You don’t have to explicitly specify what size you want, or have all floats take up 32 or 64 bits. It saves your data format from having to make awkward tradeoffs or assumptions about how to store floats.

This also says something about what kinds of values should be small. It should be those you get from discarding precision. IEEE-754 defines some precisions down to 16 bits and there are 8-bit precision values in practical use¹.

minifloat (8): seeeemmm
half (16):     seeeeemm mmmmmmmm
single (32):   seeeeeee emmmmmmm mmmmmmmm mmmmmmmm
double (64):   seeeeeee eeeemmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm

These are the formats people are likely to produce so those are the ones we’d like to be able to store in less space. Since LPV256 has fixed size buckets we can’t just fit in those sizes though. There is no 8- or 16-bit encoding. only 7, 14, 21 bits and so on. So some tweaking is required. But at a minimum it should be the case that.

The 7-bit encoding should hold values from minifloat.
Minifloat values that don’t fit within 7 bits should fit in 14.
The 14-bit encoding should hold half precision values.
Half precision values that don’t fit within 14 bits should fit in 21.
… and so on up to 32 and 64 bit values.

The construction

One approach would be to use the fact that while IEEE-754 only defines a few precisions, you can define precisions of any size. You can define as small as a 4-bit float if you want and, more to the point, a 7-, 14-, or 21-bit format. You just have to decide how many of the bits go to the exponent and mantissa. Here is one way to define them:

   7 bits: seeemmm
  14 bits: seeeemm mmmmmmm
  21 bits: seeeeem mmmmmmm mmmmmmm
  28 bits: seeeeee emmmmmm mmmmmmm mmmmmmm
  32 bits: seeeeeee emmmmmmm mmmmmmmm mmmmmmmm
  64 bits: seeeeeee eeeemmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm

This way you can define a varfloat format like this.

Take your input, say a double precision value.
Convert it to a 7-bit float. Compare the result to the original input. If they represent the same value then that is the result.
Otherwise, convert it to a 14-bit float. If the result represents the same value as the input that is your result.
…and so on for 21, 28, and 32 bits.
If none of the previous steps have succeeded the input requires full 64 bits.

Converting between different precisions is a well-defined operation (well, not quite but we’ll get to NaNs in a moment) so this encoding is well-defined too. However, it’s not obvious that it can be done efficiently. And there are devils in the details. But just to get a sense for how this works in practice here is a table of all the values that will fit in a single varfloat byte,

sign	exp	`000`	`001`	`010`	`011`	`100`	`101`	`110`	`111`
`0`	`000`	0	0.03125	0.0625	0.09375	0.125	0.15625	0.1875	0.21875
`0`	`001`	0.25	0.28125	0.3125	0.34375	0.375	0.40625	0.4375	0.46875
`0`	`010`	0.5	0.5625	0.625	0.6875	0.75	0.8125	0.875	0.9375
`0`	`011`	1	1.125	1.25	1.375	1.5	1.625	1.75	1.875
`0`	`100`	2	2.25	2.5	2.75	3 3.25	3.5	3.75
`0`	`101`	4	4.5	5	5.5	6	6.5	7	7.5
`0`	`110`	8	9	10	11	12	13	14	15
`0`	`111`	∞	NaN	NaN	NaN	NaN	NaN	NaN	NaN
`1`	`000`	-0	-0.03125	-0.0625	-0.09375	-0.125	-0.15625	-0.1875	-0.21875
`1`	`001`	-0.25	-0.28125	-0.3125	-0.34375	-0.375	-0.40625	-0.4375	-0.46875
`1`	`010`	-0.5	-0.5625	-0.625	-0.6875	-0.75	-0.8125	-0.875	-0.9375
`1`	`011`	-1	-1.125	-1.25	-1.375	-1.5	-1.625	-1.75	-1.875
`1`	`100`	-2	-2.25	-2.5	-2.75	-3	-3.25	-3.5	-3.75
`1`	`101`	-4	-4.5	-5	-5.5	-6	-6.5	-7	-7.5
`1`	`110`	-8	-9	-10	-11	-12	-13	-14	-15
`1`	`111`	-∞	NaN	NaN	NaN	NaN	NaN	NaN	NaN

It’s not a huge set and not one you’re likely to hit by accident. But it does include common values like 0, 1, -1, 0.5, ∞, -∞, etc. If you were going to pick a set of values that definitely should fit within the smallest varfloat I think it has a lot of them.

Conversion

Of course this only makes sense if we can do the conversion efficiently, and we can. To get a sense for how, consider what a single precision number looks like that can be stored as a 7-bit float:

v	7 bit	32 bit
0	`0 000 000`	`0 00000000 00000000000000000000000`
0.25	`0 001 000`	`0 01111101 00000000000000000000000`
0.5	`0 010 000`	`0 01111110 00000000000000000000000`
1	`0 011 000`	`0 01111111 00000000000000000000000`
2	`0 100 000`	`0 10000000 00000000000000000000000`
4	`0 101 000`	`0 10000001 00000000000000000000000`
8	`0 110 000`	`0 10000010 00000000000000000000000`
∞	`0 111 000`	`0 11111111 00000000000000000000000`

We’re going to treat 0 and ∞ specially so the pattern that identifies single precision values is that they have either of these two forms:

s011111eemmm00000000000000000000
s100000eemmm00000000000000000000

Similarly, 32-bit floats that can be stored in 14 bits have either of these two forms,

s01111eeemmmmmmmmm00000000000000
s10000eeemmmmmmmmm00000000000000

And so on. For each size there is a particular set of exponents that will fit, and then a particular size of mantissa. There are only 256 possible exponents so we can simply make a table from those to the smallest format that will hold that exponent. Similarly we can count the number of leading zeroes in the mantissa and map that to the smallest format that will hold that mantissa. The max of those two values will be the smallest format that will hold the value.

Let’s take an example. Say we’re given this 32-bit float (which represents 3.1415405),

01000000010010010000111100000000

First we take the exponent, 10000000 (or 128), and look it up in the exponent-to-format table. That will tell us that a 7-bit varfloat can hold values with that exponent. Then we take the mantissa, 10010010000111100000000 and count how many bits are set: 15. We look that up in the mantissa-bits-to-format table which tells us that it requires a 21-bit varfloat. The widest format required is 21 bits so that’s what we’ll use. The result is

010000100100100001111

The encoding step itself is just discarding the middle bits of the exponent (which are all the same and the negation of the top bit) as well as the bottom bits of the mantissa (which are all 0). You’ll notice that the entire process only has one branch at the end, when we dispatch on the computed format (you can make the max branchless).

To decode the value we extend the exponent immediately below the top bit with the negation of the top bit. And we extend the mantissa with least significant 0 bits,

01000000010010010000111100000000

The underlined bits are the one added. Similarly, decoding 0.5 from a 7-bit value (0 010 000) to 32 bits gives you

00111111000000000000000000000000

This works on single precision values but earlier I talked about the input being double precision? That’s because I assume a first step where we recognize double precision values that fit within single precision and also recognize the exponent for 0.0 and treat it specially. So it’s not like this encoding is free, there’s a bit of branching and nontrivial work, but not all that much.

Why do we need to treat 0.0 differently? That’s because of denormals. The pattern we use above where we can just look at the exponent and mantissa separately doesn’t work for an exponent that’s all zero. There we need 7 bits for one particular value, 0.0 itself, when the mantissa is also all zero. And for nonzero values of the mantissa we need a full 64 bits because those are denormals and they won’t fit with less precision.

This also means that what I’ve described here doesn’t produce encoded denormals. I include denormal values like 0.125 in the table of 7-bit values above but if you run the encoding process as I’ve described it here they won’t actually be stored like that. Instead they’ll end up as 14-bit normal values. That’s why those values are in italic in the table. You can add an extra step to the 14-bit encoding to recognize values that will actually fit as 7-bit denormals (and similarly for 21-bit and so on) but it’s not necessary for correctness so I’ll leave that to the reader.

NaNs

One more detail is NaNs. It’s not commonly used but each NaN value has a payload value stored in the mantissa. The problem there is that while it’s well-defined how you convert infinities, normals, and denormals to a different precision the IEEE-754 spec says nothing about how to convert NaNs. I’ve looked at how different languages and libraries specify floating point conversion and haven’t found any that mention it. Finally I’ve looked at processor instruction sets and it turns out that in practice NaNs are narrowed the same way other values are: you discard the least significant bits. So this 7-bit NaN, 0 111 110, becomes this single-precision value: 0 11111111 11000000000000000000000. That means that when encoding you can use exactly the same path for infinity and NaN as for other non-zero values which is nice. They do require special handling when decoding since you need to do sign extension slightly differently.

Endianness

The last thing to consider is: when some of the value is stored in the prefix, which part of the value to store? For a 14-bit value for instance, do we store the most significant bits in the prefix:

10seeeem mmmmmmmm

or the least significant,

10mmmmmm seeeemmm

I think it makes sense to store the least significant bits. If you look at the second byte in the second case, seeeemmm, that is just a minifloat. So you get a fast case when the prefix is 10000000 where you can just read a minifloat directly from the second byte. This also simplifies writing values that you know are minifloats. The same works for the other precisions as well:

   7 bits: 0seeemmm
  14 bits: 10mmmmmm seeeemmm
  21 bits: 110mmmmm seeeeemm mmmmmmmm
  28 bits: 1110mmmm seeeeeee mmmmmmmm mmmmmmmm
  35 bits: 11110mmm seeeeeee emmmmmmm mmmmmmmm mmmmmmmm
  64 bits: 11111000 seeeeeee eeeemmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm

You’ll notice in the 14-, 21-, 35-, and 64-bit cases that if the least significant bits of the mantissa stored in the prefix are all 0 the rest are the standard 8-, 16-, 32-, and 64-bit encodings:

   8 bits: 10000000 seeeemmm
  16 bits: 11000000 seeeeemm mmmmmmmm
  32 bits: 11110000 seeeeeee emmmmmmm mmmmmmmm mmmmmmmm
  64 bits: 11111000 seeeeeee eeeemmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm mmmmmmmm

As with the varints those are nice fast cases to have.

There is no standard canonical 8-bit format. I’ve gone with 4 exponent bits, 3 mantissa bits, and a bias of 7. ↩