Compressing multiple messages

How many bits of data does it take to en­code an \(n\)-mes­sage? Naively, the an­swer is \(\lceil \log_2(n) \rceil\) (why?): For ex­am­ple, it takes 5 bits to en­code a 21-mes­sage, be­cause 4 bits are only enough to en­code 16 differ­ent mes­sages, but 5 bits are enough to en­code 32. The use of the Ceiling func­tion im­plies an in­effi­ciency: 2 bits are re­quired to en­code a 3-mes­sage, but 2 bits are enough to dis­t­in­guish be­tween four differ­ent pos­si­bil­ities. One of those pos­si­bil­ities is be­ing wasted. That in­effi­ciency can be re­duced by en­cod­ing mul­ti­ple \(n\)-mes­sages at the same time. For ex­am­ple, while an in­di­vi­d­ual 3-mes­sage re­quires 2 bits to en­code, a se­ries of 10 3-mes­sages re­quires at most 16 bits to en­code: \(3^{10} < 2^{16}.\)

Why is it that en­cod­ing ten 3-mes­sages to­gether (us­ing bits) is cheaper than en­cod­ing ten 3-mes­sages sep­a­rately? Naively, there are three differ­ent fac­tors that al­low the com­bined en­cod­ing to be shorter than the sum of the sep­a­rate en­cod­ings: The mes­sages could have differ­ent like­li­hoods (al­low­ing the com­bined mes­sage to be com­pressed in ex­pec­ta­tion); the mes­sages could be de­pen­dent on each other (mean­ing they can be com­pressed); and the mis­match be­tween bits and 3-mes­sages gets washed out as we put more three-mes­sages to­gether (see How many bits to a trit?).

In fact, the first two fac­tors are equiv­a­lent: 10 3-mes­sages are equiv­a­lent to one \(3^{10}\) mes­sage, and in gen­eral, \(n\) \(k\)-mes­sages are equiv­a­lent to one \(n^k\)-mes­sage. If the in­di­vi­d­ual n-mes­sages are de­pen­dent on each other, then differ­ent \(n^k\) mes­sages have differ­ent like­li­hoods: For ex­am­ple, if mes­sage 3 never fol­lows mes­sage 2, then in the com­bined mes­sage, “32” never ap­pears as a sub­string.

Thus, there are two differ­ent ways that an en­cod­ing of \(k\) \(n\)-mes­sages can be shorter than \(k\) times the en­cod­ing of an \(n\)-mes­sage: The var­i­ous com­bined mes­sages can have differ­ent like­li­hoods, and the effi­ciency of the cod­ing might in­crease. To study the effect of differ­ent like­li­hoods on the en­cod­ing length in iso­la­tion, we can as­sume that the cod­ings are max­i­mally effi­cient and see how much ad­di­tional com­pres­sion the differ­ent like­li­hoods get us. To study code effi­ciency in iso­la­tion, we can as­sume each mes­sage is equally likely and see how much ad­di­tional com­pres­sion we get as we put more \(n\)-mes­sages to­gether. In prac­tice, real com­pres­sion in­volves us­ing both tech­niques at once.