Compression and SecurityIndex4. Build your own one-on-one compressor
Long, frequent symbol strings occur first, while short, rare ones are listed second. Spaces have been repaced by underscores for purposes of clarity. This table obeys the constraints listed above. A much larger sample dictionary is available as an appendix.
Creating dictionariesIt would be a relatively simple matter to feed in an entire "real-world" dictionary to the system proposed here. Spaces could be added automatically and variants with initial capital letters or ALL CAPITALS generated. A scheme for dealing with trailing commas and full-stops could be employed with the resulting short strings being automatically generated from the match of the word in the dictionary in some systematic manner. Frequency-analysis of the target text would help significantly. Similarly, making intelligent choices when it comes to dealing with entries whose texts are substrings of one another would also help produce improved compression ratios.
Combining with adaptive compressionUsing a fixed dictionary rarely produces the best-possible results on text messages. A combination involving first compressing using a dictionary, and then applying adaptive compression to build a symbol-table tailored to the message in question should be more effective on most types of email message. Note that using this type of fixed dictionary is *not* suitable for "compressing" attached files. Files with attachments should not be treated with this method.
Multiple passes with multiple dictionariesOne disadvantage of the method as presented so far is that constraints on the dictionary mean that there are limitations on which words may be compressed.
It is not possible to have more than one of "this ", "his " and "is " as dictionary entries, due to the "no shared substrings" condition. However, these are all common English words. Putting them in the dictionary would clearly be desirable. David Scott has proposed the use of multiple passes with multiple dictionaries to resolve this problem. To take this to extremes, imagine that instead of parsing the file looking for all dictionary entries, the file is parsed many times - once for each English text fragment in the dictionary. Each pass exchanges that single English text fragment with its matching obscure short string. Compression uses Dic[A], followed by Dic[B], and then Dic[C] ... Dic[Z]. Decompression reverses this, using Dic[Z] ... Dic[C], followed by Dic[B], and then Dic[A]. The result is clearly a new type of 1-1 compression. However no English word has to "compete" with any other in terms of the two constraints presented above. This new-found freedom improves the compression ratio. Of course, using a large number of passes executed in series will be likely to be computationally expensive and time consuming. Some compromise between looking up words in serial and in parallel will probably produce the most practical system.
Speed considerationsThe dictionary can be searched by using an initial alphabetic index, followed by a bisecting search (or better). This means that adding words to the dictionary has a very low cost in terms of search time, which scales in the manner of log(N), where N is the number of entries. Use of multiple passes with multiple dictionaries, will obviously be more expensive.
Size considerationsThe size of the compressor and decompressor will be approximately equal to the size of the dictionary. This may be large. However, it is a common property of a good compressor that it extracts data from the messages, leaving only the information behind, where the data is reintroduced at the other end. Any good compressor of English text necessarily needs a fairly large database to work from. Generally speaking, in the current climate - and for the foreseeable future - bandwidth is more likely to be a critical resource than memory for the majority of target applications.
An adaptive variationThe method described in this document uses a static dictionary. However, an adaptive variation is quite possible, and a simple adaptive variant is described here.
Reverse Parsing - a sister schemeThere's an alternative algorithm, which parses the file in reverse. This has some slightly different properties, and is described ion more detail here.
Compressing other types of fileThere are a number of reversible, one-on-one transformations in the literature that do not typically perform any compression. One notable such transform is the Fourier transform. If you believe that you data contains structural information in the frequency domain, performing a Fourier transform before compression may help extract this information during compression. In general, any reversible one-on-one transformation which reveals "structure" in your data might help.
|