Compression and Security

Index

4. Build your own one-on-one compressor

The technique described here is a simple dictionary-based compression scheme. It works by extracting any frequent, long sub-strings and swapping them with infrequent short ones, in a manner that is guaranteed to be completely reversible.

The dictionary used in this operation is compiled in advance and may be optimised - by hand if necessary - in order to produce the best-possible characteristics for the target files.

Files are parsed from top to bottom. The compressor systematically searching for any and all words which match any of those in the dictionary, and replacing them by their matching partners. After a replacement has been made, the file continues to be scanned from the first character after the last one that has just been replaced.

As well as swapping long frequent strings with short infrequent ones, the reverse operation also happens simultaneously. This ensures the operation is reversible.

To decompress a compressed file, exactly the same process is repeated.

In order to produce a compressor, two tables of words are required. Each table contains an equal number of strings, so that a bijection between them is possible. Together, the words in the two tables comprise a single dictionary.

One table contains high-frequency long sub-strings from the text. The other contains low frequency short symbols (including sub-strings with a frequency of zero).

The entirety of both tables in the dictionary is then "cleaned" according to the following two conditions:

No string in the tables should contain another such string as a substring;
No leading symbols in any string should exactly match the trailing symbols in a different string.

Between them these conditions are both necessary and sufficient for lossless, "one-on-one" compression to occur.

Explanations of the above two conditions follow:

The "no shared substrings" condition:

Strings should never completely contain other strings present anywhere in the tables.

As an example of the this condition, some word pairs which would violate it are now presented in a table.

Each dictionary entry has had any spaces replaced by underscores so they become more visible.

The short substring has been placed in brackets and underlined.

*Shared substrings are not allowed*
per(son_)	be(cause_)	s(light_)	out(side_)	mes(sage_)
h(old_)	w(hat_)	r(age_)	s(end_)	l(ate_)
c(over_)	y(our_)	h(and_)	t(his_)	w(here_)
t(hen_)	pl(ant_)	a(way_)	d(raw_)	mot(her_)
gou(lash_)	th(ought_)	p(air_)	hi(story_)	ti(me_)
re(cover_)	re(pair_)	re(play_)	g(runt_)	h(is_)

As you can see, the fact that each dictionary entry happens to contain a trailing space makes shared sub-strings possible only at the ends of strings.

This condition is necessary to prevent pathological cases: If "and" and "wander" are both words - and if "and" <=> "#" and "wander" <=> "&", then there is no chance that the string "w#er" will decompress to "wander" - and then compress to "&".

The "leading strings never match trailing strings" condition:

The potential problem here is a similar one:

If "lathe" and "then" are both words - and if "lathe" <=> "#" and "then" <=> "&", then there is the possibility that the string "la&" will "decompress" to "lathen", which will then recompress to "#n" (since the file is parsed from top to bottom).

Using our example dictionary, this condition never arises - since all dictionary entries end with punctuation-style symbols, and always start with alphabetic characters.

Automated dictionary verification

This cleansing operation can be easily formalized into a program that accepts a long list of proposed strings for a dictionary, and reports any clashes between the entries.

A Java applet demonstrating the check would be appropriate at this point - if one had been written.

A sample dictionary

One way to demonstrate the type of system we are talking about is to exhibit a sample dictionary.

Here is a small fragment of one we created earlier:

the_	~
and_	`
this_	a\
where_	b\
but_	c\
thought_	d\
are_	e\
why_	a$
who_	b$
when_	c$

you_	e$
your_	a%
then_	b%
eat_	c%
ate_	d%
might_	a0
right_	b0
home_	c0
see_	d0
time_	a&

cause_	b&
indeed_	c&
between_	d&
both_	a}
sometimes_	b}
see_	c{
what_	a{
would_	b{
seen_	c{
take_	a@

press_	b@
hold_	c@
test_	M]
make_	P]
horizontal_	D]
mend_	F]
degree_	Q[
quantity_	Z[
stuff_	a\|
thing_	b\|

A dictionary as a bijection between two lists

Long, frequent symbol strings occur first, while short, rare ones are listed second.

Spaces have been repaced by underscores for purposes of clarity.

This table obeys the constraints listed above.

A much larger sample dictionary is available as an appendix.

Creating dictionaries

It would be a relatively simple matter to feed in an entire "real-world" dictionary to the system proposed here. Spaces could be added automatically and variants with initial capital letters or ALL CAPITALS generated.

A scheme for dealing with trailing commas and full-stops could be employed with the resulting short strings being automatically generated from the match of the word in the dictionary in some systematic manner.

Frequency-analysis of the target text would help significantly.

Similarly, making intelligent choices when it comes to dealing with entries whose texts are substrings of one another would also help produce improved compression ratios.

Combining with adaptive compression

Using a fixed dictionary rarely produces the best-possible results on text messages.

A combination involving first compressing using a dictionary, and then applying adaptive compression to build a symbol-table tailored to the message in question should be more effective on most types of email message.

Note that using this type of fixed dictionary is *not* suitable for "compressing" attached files. Files with attachments should not be treated with this method.

Multiple passes with multiple dictionaries

One disadvantage of the method as presented so far is that constraints on the dictionary mean that there are limitations on which words may be compressed.

It is not possible to have more than one of "this ", "his " and "is " as dictionary entries, due to the "no shared substrings" condition. However, these are all common English words. Putting them in the dictionary would clearly be desirable.

David Scott has proposed the use of multiple passes with multiple dictionaries to resolve this problem.

To take this to extremes, imagine that instead of parsing the file looking for all dictionary entries, the file is parsed many times - once for each English text fragment in the dictionary. Each pass exchanges that single English text fragment with its matching obscure short string.

Compression uses Dic[A], followed by Dic[B], and then Dic[C] ... Dic[Z].

Decompression reverses this, using Dic[Z] ... Dic[C], followed by Dic[B], and then Dic[A].

The result is clearly a new type of 1-1 compression. However no English word has to "compete" with any other in terms of the two constraints presented above. This new-found freedom improves the compression ratio.

Of course, using a large number of passes executed in series will be likely to be computationally expensive and time consuming. Some compromise between looking up words in serial and in parallel will probably produce the most practical system.

Speed considerations

The dictionary can be searched by using an initial alphabetic index, followed by a bisecting search (or better). This means that adding words to the dictionary has a very low cost in terms of search time, which scales in the manner of log(N), where N is the number of entries.

Use of multiple passes with multiple dictionaries, will obviously be more expensive.

Size considerations

The size of the compressor and decompressor will be approximately equal to the size of the dictionary. This may be large.

However, it is a common property of a good compressor that it extracts data from the messages, leaving only the information behind, where the data is reintroduced at the other end. Any good compressor of English text necessarily needs a fairly large database to work from.

Generally speaking, in the current climate - and for the foreseeable future - bandwidth is more likely to be a critical resource than memory for the majority of target applications.

An adaptive variation

The method described in this document uses a static dictionary.

However, an adaptive variation is quite possible, and a simple adaptive variant is described here.

Reverse Parsing - a sister scheme

There's an alternative algorithm, which parses the file in reverse. This has some slightly different properties, and is described ion more detail here.

Compressing other types of file

There are a number of reversible, one-on-one transformations in the literature that do not typically perform any compression.

One notable such transform is the Fourier transform.

If you believe that you data contains structural information in the frequency domain, performing a Fourier transform before compression may help extract this information during compression.

In general, any reversible one-on-one transformation which reveals "structure" in your data might help.

Index | Links

tim@tt1.org | http://mandala.co.uk/