Compressing multiple messages

How many bits of data does it take to encode an \(n\)-message? Naively, the answer is \(\lceil \log_2(n) \rceil\) (why?): For example, it takes 5 bits to encode a 21-message, because 4 bits are only enough to encode 16 different messages, but 5 bits are enough to encode 32. The use of the Ceiling function implies an inefficiency: 2 bits are required to encode a 3-message, but 2 bits are enough to distinguish between four different possibilities. One of those possibilities is being wasted. That inefficiency can be reduced by encoding multiple \(n\)-messages at the same time. For example, while an individual 3-message requires 2 bits to encode, a series of 10 3-messages requires at most 16 bits to encode: \(3^{10} < 2^{16}.\)

Why is it that encoding ten 3-messages together (using bits) is cheaper than encoding ten 3-messages separately? Naively, there are three different factors that allow the combined encoding to be shorter than the sum of the separate encodings: The messages could have different likelihoods (allowing the combined message to be compressed in expectation); the messages could be dependent on each other (meaning they can be compressed); and the mismatch between bits and 3-messages gets washed out as we put more three-messages together (see How many bits to a trit?).

In fact, the first two factors are equivalent: 10 3-messages are equivalent to one \(3^{10}\) message, and in general, \(n\) \(k\)-messages are equivalent to one \(n^k\)-message. If the individual n-messages are dependent on each other, then different \(n^k\) messages have different likelihoods: For example, if message 3 never follows message 2, then in the combined message, ā€œ32ā€ never appears as a substring.

Thus, there are two different ways that an encoding of \(k\) \(n\)-messages can be shorter than \(k\) times the encoding of an \(n\)-message: The various combined messages can have different likelihoods, and the efficiency of the coding might increase. To study the effect of different likelihoods on the encoding length in isolation, we can assume that the codings are maximally efficient and see how much additional compression the different likelihoods get us. To study code efficiency in isolation, we can assume each message is equally likely and see how much additional compression we get as we put more \(n\)-messages together. In practice, real compression involves using both techniques at once.