Header Compression format for HTTP/2, known as HPACK, foresees the use of the Huffman algorithm for encoding header literal values. This contributes to the additional decrease in the quantity of data, transferred with each web request and response.
A Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code proceeds by means of Huffman coding, an algorithm developed by David A. Huffman. The output from Huffman’s algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file). The algorithm derives this table from the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol. As in other entropy encoding methods, more common symbols are generally represented using fewer bits than less common symbols. Huffman’s method can be efficiently implemented, finding a code in time linear to the number of input weights if these weights are sorted. (Source: Wikipedia)
HPACK compression entails a pre-created canonical Huffman code table for encoding ASCII characters to the Huffman sequence. A canonical Huffman code is a particular type of Huffman code with unique properties that allow it to be described in a very compact manner. In the aforementioned table are the Huffman codes for each ASCII character with a length up to 32 bits (4x by 8 fields with value 0 or 1), in the form of base-2 integer, aligned on the most significant bit (MSB is the bit farthest to the left).
Encoding is relatively easy since we are replacing the individual characters with the Huffman code. We add theEOS sign at the end to always fill the entire octet.
The illustration shows how the encoder iterates through all the ASCII characters and replaces them with the Huffman code. Each line ends with the EOS character which serves as (up to 7 bits) padding.
While adding the Hoffman code to the sequence, the length of the added code must exactly match the number of bits specified in the documentation. Working with Huffman codes in bytes and then converting them to other types, such as strings, could remove the prepended zeros. In such cases, we have to do some plumbing to ensure all bits are there (an example of this would be the character “%”).
Implementation could be achieved by manipulating a string of ones and zeros. However, for more complex systems such as high-performance web servers, this would not be sustainable from the performance perspective. To manage resources accordingly, we require innovation so the investments are protected.
A replacement of the string with characters such as numbers, which are more appropriate for computers, and the use of bitwise operators gives a significant increase in performance. Before this can be done, we need to have an understanding of how the numbers are added. Although we are all aware of what “1+2=3” is, or what is a concatenation of a string such as “aa+bb=aabb”, in bit operations, these rules are not quite so obvious. Let’s see an example of the addition with bits directly:
For the sum of two bit numbers, we used the bitwise operator OR denoted by the “|“ symbol which serves as a sign for addition “+” in our example. Its rule is to trace the bits of both numbers and, if a 0 or a 1 is found on the same spot, change their value to 1, while setting the value to 0 in other cases. This understanding now enables us to re-implement the example above.
Instead of a string, we will use a u64 data type storing a string of 64 bits. We could also use a data type with a larger capacity (such as u128), but u64 is sufficient. The storage requirement is 32 bits, which is the maximum length of the individual Huffman code plus an extra byte (8) for the surplus cell, meaning that we need 40 bits of storage altogether.
The illustration below shows individual steps for encoding a string of characters as in the example above, while the encoding is carried out with the use of numbers and bitwise operators.
Although the illustration is quite similar to the previous one, it is much more colorful. It is also apparent that a string of bits is getting shorter on the left and longer on the right end.
When the Huffman code is added to the string for the individual character, the algorithm immediately ensures 32 free bit spaces where the next character will be added. This is achieved by the so-called shifting bits using the “<<” bitwise operator. Since we are dealing with bit numbers, we always rotate for 1 or more bytes, dependent on the required capacity, meaning for 8*N bits. It might not be obvious but it is interesting that, by rotating bits and adding the new Huffman character, we are adding numbers in the same way as we did in the simple example previously presented.
The full encoder implementation is available at the public GitHub repository.
If looked at separately, the Huffman algorithm is quite simple. But when we don’t intend to only implement it, but we are, instead, interested in the maximization of the performance and lowering of used resources, things get more complicated. The performance and quality of this solution are, therefore, comparable to the implementation in some well-known web servers.
In my next article HPACK: Huffman translation matrix I dive into the decoding process.