Lempel-Ziv-Welch algorithm

tags
Compression, Complexity
papers
(Lempel, Ziv 1976; Ziv, Lempel 1977; Ziv, Lempel 1978; Welch 1984; Storer, Szymanski 1982)

Context

The LZW algorithm was originally designed as a complexity (“randomness”) metric for finite sequences (Lempel, Ziv 1976). It was then extended as a compression algorithm by the same authors to LZ77 (Ziv, Lempel 1977) and LZ78 (Ziv, Lempel 1978). Those last two are the basis of many well known and widely used compression utilities such as GIF, compress (LZW (Welch 1984) ) or DEFLATE, gzip (LZSS (Storer, Szymanski 1982)), etc.

Description

The base idea of these algorithms is to build a dictionary of words in the input sequence. A given sequence is scanned and an empty dictionary is initialized. Each new word is the shortest next word not in the dictionary. It is registered by using a previous word and adding a single digit in the form \((j, s_{last})\). This encoding allows a unique decoding of the original sequence. When encoding or decoding the string the same dictionary of words is built. For sequences with string repetitions, the length of each new word increases fast and the numbers of needed pairs to encode increases very slowly.

Example implementation of the LZ algorithm in Python:

def compress(string):
    dic = {'0': 0, '1': 1}
    word = ""
    result = []
    for c in string:
        wc = word + c
        if wc in dic:
            word = wc
        else:
            result.append(dic[word])
            dic[wc] = len(dic)
            word = c
    if word:
        result.append(dic[word])
        return result

print(compress("1101010101011101010001110100"))
print(compress("1010101010101010101010101010"))

Results:

[1, 1, 0, 3, 5, 4, 4, 2, 7, 3, 0, 8, 6, 0]
[1, 0, 2, 4, 3, 6, 5, 8, 7, 6]

Bibliography

  1. . . "On the Complexity of Finite Sequences". IEEE Transactions on Information Theory 22 (1):75–81. DOI.
  2. . . "A Universal Algorithm for Sequential Data Compression". IEEE Transactions on Information Theory 23 (3):337–43. DOI.
  3. . . "Compression of Individual Sequences via Variable-rate Coding". IEEE Transactions on Information Theory 24 (5):530–36. DOI.
  4. . . "A Technique for High-performance Data Compression". Computer 17 (6):8–19. DOI.
  5. . . "Data Compression via Textual Substitution". Journal of the ACM (JACM) 29 (4):928–51. DOI.

Links to this note

Last changed | authored by

Comments


← Back to Notes