# The Burrows-Wheeler transform and move-to-front compression

The above form implements algorithms to code and decode from the move-to-front compression scheme, with or without an additional application of the Burrows-Wheeler transform. These extremely slick algorithms blew my mind when I first heard of them. They are currently in use, among other places, in bzip2 file compression.

Move-to-front is a natural and elegant compression scheme for encoding streams of data into a binary sequence, which performs particularly well when its input has many repetitions. In general, move-to-front is optimal in the sense that the length of its output sequence asymptotically attains at most a constant factor of the empirical entropy of the input stream.

We first suppose that our input stream has characters drawn from the
alphabet \(\Sigma\); for binary inputs, \(\Sigma = {0, 1}\). If the
encoder and decoder were both aware of the alphabet maintained in a
fixed and agreed upon order, then we could encode the input stream by
simply sending the stream of indices in the alphabet corresponding to
each character. For example, with the alphabet \(\Sigma = (a, b, c,
\dots, z)\), we could represent the message `mississippi`

with the
sequence of indices ```
13, 9, 19, 19, 9, 19, 19, 9, 16, 16,
9
```

.

Move-to-front makes the simple change over this trivial scheme that
each time a character is polled, it is moved to the front of the
alphabet. For example, in the first step of encoding `mississippi`

,
the alphabet \(\Sigma\) is scanned from left to right until the symbol
`m`

is encountered in the 13-th index, so we write `13`

and move `m`

to the front of the alphabet, which now becomes \(\Sigma = (m, a, b, c,
\dots, l, n, \dots, z)\). Each step of encoding `mississippi`

is
detailed below:

`m`

index`13`

, \(\Sigma = (m, a, b, \dots, z)\).`i`

index`10`

, \(\Sigma = (i, m, a, b, \dots, z)\).`s`

index`19`

, \(\Sigma = (s, i, m, a, b, \dots, z)\).`s`

index`1`

, \(\Sigma = (s, i, m, a, b, \dots, z)\).`i`

index`2`

, \(\Sigma = (i, s, m, a, b, \dots, z)\).`s`

index`2`

, \(\Sigma = (s, i, m, a, b, \dots, z)\).`s`

index`1`

, \(\Sigma = (s, i, m, a, b, \dots, z)\).`i`

index`1`

, \(\Sigma = (i, s, m, a, b, \dots, z)\).`p`

index`17`

, \(\Sigma = (p, i, s, m, a, b, \dots, z)\).`p`

index`1`

, \(\Sigma = (p, i, s, m, a, b, \dots, z)\).`i`

index`2`

, \(\Sigma = (i, p, s, m, a, b, \dots, z)\).

With move-to-front, `mississippi`

is thus encoded with the sequence
`13, 10, 19, 1, 2, 2, 1, 1, 17, 1, 2`

. In comparison with the trivial
scheme, this has drastically reduced the magnitude of frequently used
characters.

We still have to encode the integers in this index sequence. When we know that these indices are bounded within some range, as they are in this case between \(1\) and \(|\Sigma|\), it is common to use a fixed-width encoding; in computing terms, each character is represented using a fixed number of bits or bytes. However, in order to take advantage of the magnitude reduction of indices in move-to-front, it is judicious to make use of a variable-length encoding. We focus here on Elias codes.

The Elias codes are each prefix-free codes of natural numbers. A code
is prefix-free if no codeword is the prefix of another. Note that it
is possible to decode a sequence of codewords from a prefix-free code
immediately once each new codeword is encountered. The simplest
prefix-free code for natural numbers is the *unary code*, where the
integer `i`

is encoded using a sequence of \(i - 1\) `0`

characters,
followed by a single `1`

. This last `1`

is necessary to delimit
codewords in order to make the code prefix-free. It’s obvious that the
unary codeword for \(i\) has length \(i\).

The Elias \(\gamma\)-code is another prefix-free code of natural
numbers. Observe first that the binary codeword for the integer `i`

has length \(\lfloor \log_2 i \rfloor + 1\) bits, and the
most-significant bit here is always a `1`

. This by itself is not a
prefix-free code. The Elias \(\gamma\)-code prepends a sequence of
\(\lfloor \log_2 i \rfloor\) `0`

bits to the binary codeword for `i`

,
thereby creating a prefix-free code of length \(2 \lfloor \log_2 i
\rfloor + 1\) for the integer `i`

.

In principle, the Elias \(\gamma\)-code presents the number \(\lfloor
\log_2 i \rfloor + 1\) in a unary encoding through the most
significant bits of the codeword for `i`

, which tells us the remaining
bits to be read in order to guarantee prefix-freeness. If we encode
this integer instead with an Elias \(\gamma\)-code itself instead of
unary, then we obtain the Elias \(\delta\)-code. For the integer \(i\),
the Elias \(\delta\)-code has length \(\log_2 i + O(\log \log i)\). It’s
actually possible to iterate this process as much as possible to
obtain the Elias \(\omega\)-code, but the Elias \(\delta\)-code is
powerful enough for most purposes that the \(\omega\)-code is rarely
used, even in theoretical works.

You can try to use the form at the top of this post to encode
`mississippi`

in move-to-front with Elias \(\delta\)-codes, by typing it
in the input tab and switching to the MTF tab.

There is one simple and reversible transformation which we can make to
the input in order to greatly improve move-to-front compression
performance on highly structured inputs: this is the Burrows-Wheeler
transform. Simply
put, the Burrows-Wheeler transform of an input string `str`

takes all
all cyclic rotations of the string `str + $`

, where `$`

is an
otherwise unused character, and sorts these rotations in lexicographic
order, taking the last character of each of these sorted
rotations.

It’s not hard to see why this transform can be inverted. But why should it help in move-to-front compression for structured input? As stated on Wikipedia,

To understand why this creates more-easily-compressible data, consider transforming a long English text frequently containing the word “the”. Sorting the rotations of this text will group rotations starting with “he “ together, and the last character of that rotation (which is also the character before the “he “) will usually be “t”, so the result of the transform would contain a number of “t” characters along with the perhaps less-common exceptions (such as if it contains “Brahe “) mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).

There are bijective variants of the Burrows-Wheeler transform, which don’t require the introduction of a new character at the end of the input; I implemented the variant of Gil and Scott in the form at the top of this post.

So how well does this perform? The text above this paragraph, encoded with move-to-front by itself, takes 42260 bits. With the addition of the Burrows-Wheeler transform, this is brought all the way down to 21452 bits, almost half as many!