The Burrows-Wheeler transform, revisited
In my previous post, I discussed the Burrows-Wheeler transform, and embedded an HTML form on the page which dynamically computed the forward and inverse Burrows-Wheeler transforms with an additional move-to-front compression step. My initial implementation of this form was done in pure JavaScript with no particular care for efficiency, and in particular used \(O(n^2 \log n)\) space and time. The new implementation on the same page reduces this complexity to \(O(n \log n)\) time and \(O(n)\) space, making the form usable on larger texts. It is possible, as noted in the paper of Gil and Scott, that this time complexity can be reduced to \(O(n)\) by using linear time suffix array construction algorithms.
The new implementation has been rewritten in TypeScript, which makes it much easier on the eyes. The full source code is available on GitHub.
To show off the new efficiency improvements, and the efficacy of the Burrows-Wheeler transform, we can try to compress the entire text of an English translation of Dostoevsky’s Crime and Punishment, which is made available for free online thanks to Project Gutenberg. In this case, after a few seconds of execution on my laptop, we can observe the completed transformation. A visual inspection of the BWT itself reveals long chunks of repeated characters throughout. The following tables contains statistics on the resulting compression:
Bit Count | Compression Ratio | |
---|---|---|
Original (UTF-8) | 9,434,200 | 1.00 |
MTF | 8,122,221 | 0.86 |
BWT+MTF | 3,336,189 | 0.35 |
How does the Burrows-Wheeler transform perform on texts written in a different language? In my personal experience, French text is absolutely full of extraneous characters, especially e’s, a’s, u’, and s’s, so this might suggest the Burrows-Wheeler transform to be effective in compressing French text. We ran the algorithm on Zola’s Germinal:
Bit Count | Compression Ratio | |
---|---|---|
Original (UTF-8) | 8,415,728 | 1.00 |
MTF | 7,188,574 | 0.85 |
BWT+MTF | 2,947,192 | 0.35 |
Interestingly, the compression behaved almost identically as on the English text!
We also consider Chinese text–in this case, Romance of the Three Kingdoms. The results follow:
Bit Count | Compression Ratio | |
---|---|---|
Original (UTF-8) | 14,735,232 | 1.00 |
MTF | 7,460,164 | 0.51 |
BWT+MTF | 5,229,807 | 0.35 |
Surprisingly to me, the 35% compression ratio returns, even with the enormous number of unique characters used in the Chinese text: 4225, when compared to Crime and Punishment’s 97 and Germinal’s 103.