Bigger Better Varints
I recently found myself needing a varint encoding. There are lots of those around but especially after reading about some of the gymnastics people have been doing to make LEB128 fast I decided to try to come up with something more efficient. I’ll describe that format, LPV256, below.
Just to rehash, LEB128 works by chopping a given number into 7-bit groups and then chaining them together. For instance, the integer 1234567 in binary is
100101101011010000111
Chop that into 7-bit groups like so1,
1001011 0101101 0000111
Add a top bit that is 0
for the last group and 1
for all the others:
11001011 10101101 00000111
(I’ll call the top bit the continuation bit since 1
means “there are more blocks coming” and 0
means “this is the last block”)
Using this pattern you can encode any integer value and the smaller the value the shorter the encoded representation. Neat.
The problem is that to decode this you need to check the continuation bit for each block separately2. A straightforward implementation would need at least 3 branches to decode a 3-byte value. That’s not great for performance. One common approach to try to fix this is to shuffle the continuation bits around. If you’re okay with only encoding 32-bit values and reading them 4 at a time for instance you can take
123456789 = 111010110111100110100010101
3456789 = 1101001011111100010101
56789 = 1101110111010101
89 = 1011001
Chop them into groups of 8 bits,
123456789 = 00000111 01011011 11001101 00010101
3456789 = 00110100 10111111 00010101
56789 = 11011101 11010101
89 = 01011001
Prefix each one with the number of bytes, encoded so that 00
means 1 byte, 01
means 2, and so on:
123456789 = 11 00000111 01011011 11001101 00010101
3456789 = 10 00110100 10111111 00010101
56789 = 01 11011101 11010101
89 = 00 01011001
Finally shuffle all the prefixes together into the first byte, followed by the encoded numbers
11100100 00000111 01011011 11001101
00010101 00110100 10111111 00010101
11011101 11010101 01011001
These 11 bytes encode the 4 numbers and the first byte tells you what the layout is: first a 4-byte, then a 3-byte, then a 2-byte, and finally a 1-byte value. This means you can decode the whole thing with just a single branch on the first byte. Very efficient.
However, this only works if you can limit yourself to 32-bit values and if you’re reading the values sequentially. That doesn’t work for what I’m doing. I want to be able to represent very large values, for instance cryptographic hashes, GUIDs, and the like. They are large integers too, we just don’t usually store them as such. Also I want to be able to access values randomly rather than reading them sequentially. So this scheme just doesn’t work.
7z uses a model that also shuffles the continuation bits together into a prefix but it does for each varint, rather than for 4 at a time. If we take the 4 numbers from the example before you can think of it as sort of using the same approach as LEB128 but with a twist. Again you chop the values into blocks of 7 bits
123456789 = 0111010 1101111 0011010 0010101
3456789 = 0000001 1010010 1111110 0010101
56789 = 0000011 0111011 1010101
89 = 1011001
Then for each you add a prefix that encodes the length in unary: 1 block is 0
, 2 blocks is 10
, 3 blocks is 110
, 4 is 1110
, and so on. So you get
123456789 = 1110 0111010 1101111 0011010 0010101
3456789 = 1110 0000001 1010010 1111110 0010101
56789 = 110 0000011 0111011 1010101
89 = 0 1011001
You’ll note that the prefix always has as many bits as the number of blocks, so the total length of the prefix plus the blocks will always be a multiple of 8. So we divide the whole value, including the prefix, into blocks of 8 instead of 7,
123456789 = 11100111 01011011 11001101 00010101
3456789 = 11100000 00110100 10111111 00010101
56789 = 11000000 11011101 11010101
89 = 01011001
Now you can tell just from looking at the first byte how long the value is: find the most significant 0
bit, the number will be as long as the number of 1
bits above it plus one. So you only need one branch per value. The limitation is that when you get to 64-bit values you run out of space in the prefix byte. And I want values larger than 64 bits.
So here is what I ended up with. I do exactly the same as 7z for values that fit within 35 bits:
7 bits: 0xxxxxxx
14 bits: 10xxxxxx xxxxxxxx
21 bits: 110xxxxx xxxxxxxx xxxxxxxx
28 bits: 1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx
35 bits: 11110xxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx
Above 35 bits the scheme changes. Instead of the prefix containing the length as well as some of the value itself, the whole first byte says how long the value that follows is:
64 bits: 11111000 + 8 b
128 bits: 11111001 + 16 b
256 bits: 11111010 + 32 b
512 bits: 11111011 + 64 b
1024 bits: 11111100 + 128 b
2048 bits: 11111101 + 256 b
unused: 1111111x
This way you can represent up to 2048-bit numbers which is not infinite but significantly more than the estimated number of particles in the observable universe3. It’s plenty good enough for my use.
These schemes are usually called some opaque abbreviation so I’ll call this LPV256, LPV for Little-endian Prefix Variant, and the 256 is because it can represent up to 256-byte values.
This tradeoff, being able to decode values with a single branch and also being able to represent very large values, isn’t free. It means that some values take up more space than if we were using LEB128. To get a sense for how much, here is a graph of how many bytes it takes to represent a n-bit number in both representations up to 64 bits. (Smaller is better).
Up to 35 bits it’s the same, then there’s a range where LEB128 is smaller or the same, and then LPV256 is slightly smaller for full 64-bit values.
This might look like a bad tradeoff when you just see it. between 36 and 57 bits LEB128 is smaller, and only from 57 to 64 is LPV256 as small or smaller. That makes it look like LEB128 is smaller around 70% of the time. But that’s misleading. Each increase in bit count doubles the number of values so in reality, if you pick a random 64-bit value there is a less than 1% chance that it lands within the range where LEB128 is smaller. And there’s a 50% chance that it lands within the range where LPV256 is strictly smaller.
This is only looking at values up to 64 bits. If we go up to 256 bits the pattern is similar,
This isn’t that meaningful a comparison because you wouldn’t store values that large as LEB128. Here’s why: the number of branches it takes to decode values,
To decode a 256-bit value encoded as LEB128 would require 37 branches and 37 bits of arithmetic to build the result. And until you meed the 37’th byte you don’t know that what you’re building is a 256-bit value which makes building the result even more expensive. With LPV256 you know when you read the first byte that it’s a 256-bit value and it takes no work to construct the result, it can be read directly from the 32 bytes following the prefix.
Big or little prefix?
One thing that’s worth considering is: in the case where we store a bit of the value within the prefix, do we store the most or least significant bits of the value? It would seem more little-endian to store the least significant bits first but there are good reasons to be big-endian and store the most significant first. If we expect to often store 8-, 16-, and 32-bit values then those will be stored as the 14-, 21-, and 35-bit variants respectively. Here’s how an 8-bit value (let’s use 255
) would be stored with a little-endian prefix:
10111111 00000011
and here it is with a big-endian prefix:
10000000 11111111
In the first case you need to do arithmetic to shift the least significant bits from the prefix into the result. In the second case the result is right there in the second byte, and you can recognize that it is just by looking at the first byte. The same is the case for 16- and 32-bit values. So it permits a fast path in some cases I expect are common.
You can think of it as extending the scheme used for bigger values, where the prefix holds no bits from the value, into the smaller ones:
8 bits: 10000000 + 1 b
16 bits: 11000000 + 2 b
24 bits: 11100000 + 3 b
32 bits: 11110000 + 4 b
64 bits: 11111000 + 8 b
…
Whether you view the prefix 10000000
as the tag for an 8-bit value or as the prefix of a 14-bit value where the top 6 bits happen to be 0 doesn’t matter. Both are equally valid views and the result is the same.
Besides providing a useful fast case it is helpful for instance in patching. Say you want to write a block of data, starting with a varint that gives the size. If you know the size up front no problem but if you only know the size after you’ve actually written the data, as long as you know it will fit within say 32 bits you can write a placeholder 32-bit zero where the size goes,
11110000 00000000 00000000 00000000 00000000
Then once you’re done writing the data and know the length you can go back and write the size as a plain 32-bit int in the 4 last bytes without having to touch the prefix. If the low bits go in the prefix then you would have to do nontrivial work to get the encoding right.
That does mean that if the size turns out to be say 17 it has to be legal to encode that as a 35-bit value. We can’t require that you use the smallest possible representation, in this case that would a 7-bit value. But there’s little to be gained from requiring that anyway.
I’ll write more about what I need this for but I thought it was an interesting bit of design in itself.
-
I’m being a little sloppy with the endianness but it should be little endian everywhere. ↩
-
You can be arbitrarily clever here, for instance you might shuffle the continuation bits of the next 8 bytes together into a single byte before you dispatch, that way you only dispatch once per 8 encoded bytes. That works but has a lot of corner cases, it only works if you’re reading multiple values together, and the shuffling isn’t exactly free. ↩
-
If we estimate that there’s 1089 particles in the visible universe, that’s roughly 2296 so 512 bits is actually enough for there to be a unique value for each particle. ↩