Introduction

This project revolves around an important text processing task, text compression. In particular, you will be required to encode a sequence of words read from a source file into binary strings (using only the characters 0 and 1). It is important to note that text compression makes it possible to minimize the time needed to transmit text over a low-bandwidth channel, such as infrared connection. Moreover, text compression is helpful in storing large documents more efficiently. The coding scheme explored in this project is the Huffman Coding Scheme. While standard encoding schemes, such as Unicode and ASCII, use fixed-length binary strings to encode characters, Huffman coding assigns variable-length codes to characters. The length of a Huffman code depends on the relative frequency of its associated character. Specifically, Huffman coding capitalizes on the fact that some characters are used more frequently than others to use short codewords when encoding high-frequency characters and long codewords to encode low-frequency characters. Huffman coding saves space over state of the art fixed-length encoding and is therefore at the heart of file compression techniques in common use today. Figure 1 shows the relative frequencies of the letters of the alphabet as they appear in a representative sample of English documents.

Letter Frequency Letter Frequency

A 77 N 67

B 17 O 67

C 32 P 20

D 42 Q 5

E 120 R 59

F 24 S 67

G 17 T 85

H 50 U 37

I 76 V 12

J 4 W 22

K 7 X 4

L 42 Y 22

M 24 Z 2

Figure 1. Relative frequencies for the 26 letters of the alphabet.

Huffman coding and decoding

Huffman’s algorithm for producing optimal variable-length codes is based on the construction of a binary tree T that represents the code. In other words, the Huffman code for each character is derived from a full binary tree known as the Huffman coding tree, or simply the Huffman tree. Each edge in the Huffman tree represents a bit in a codeword, with each edge connecting a node with its left child representing a “0” and each edge connecting a node with its right child representing a “1”. Each external node in the tree is associated with a specific character, and the Huffman code for a character is defined by the sequence of bits in the path from the root to the leaf corresponding to that character. Given codes for the characters, it is a simple matter to use these codes to encode a text message. You will have simply to replace each letter in the string with its binary code (a lookup table can be used for this purpose).

In this project, you are not going to use the table given in Figure 1 to determine the frequency of occurrence per character. Instead, you will derive the frequency corresponding to a character by counting the number of times that character appears in an input file. For example, if the input file contains the following line of text “a fast runner need never be afraid of the dark”, then the frequencies listed in the table given in Figure 2 should be used per character:

Character a b d e f H i k n O r s t u v

Frequency 9 5 1 3 7 3 1 1 1 4 1 5 1 2 1 1

Figure 2. The frequency of each character of the String X.

Based on the frequencies shown in Figure 2, the Huffman tree depicted in Figure 3 can be constructed:

Figure 3. Huffman tree for String X.

The code for a character is thus obtained by tracing the path from the root of the Huffman tree to the external node where that character is stored, and associating a left edge with 0 and a right edge with 1. In the context of the considered example for instance, the code for “a” is 010, and the code for “f” is 1100.

Once the Huffman tree is constructed and the message obtained from the input file has been encoded, decoding the message is done by looking at the bits in the coded string from left to right until all characters are decoded. This can be done by using the Huffman tree in a reverse process from that used to generate the codes. Decoding the bit string begins at the root of the tree. Branches are taken depending on the bit value – left for ‘0’ and right for ‘1’ – until reaching a leaf node. This leaf contains the first character in the message. The next bit in the code is then processed from the root again to start the next character. The process is repeated until all the remaining characters are decoded. For example, to decode the string “0101100” in the case of the example under study, you begin at the root of the tree and take a left branch for the first bit which is ‘0’. Since the next bit is a ‘1’, you take a right branch. Then, you take a left branch (for the third bit ‘1’), arriving at the leaf node corresponding to the letter a. Thus, the first letter of the coded word is a. You then begin again at the root of the tree to process the fourth bit, which is a ‘1’. Taking 2 right branches then two left branches, you reach the leaf node corresponding to the letter f.

Problem statement

You are required to implement the Huffman coding/decoding algorithms. After you complete the implementation of the coding/decoding processes, you are asked to use the resulting Java code to:

1. Read through a source file called “in1.dat” that contains the following paragraph:

“the Huffman coding algorithm views each of the d distinct characters of the string X as being in separate Huffman trees initially with each tree composed of a single leaf node these separate trees will eventually be joined into a single Huffman tree in each round the algorithm takes the two binary trees with the smallest frequencies and merges them into a single binary tree it repeats this process until only one tree is left.”

2. Determine the actual frequencies for all the letters in the file.

3. Use the frequencies from the previous step to create a Huffman coding tree before you assign codes to individual letters. Use the LinkedBinaryTree class that we developed in class to realize your Huffman coding trees.

4. Produce an encoded version of the input file “in1.dat” and then store it in an output file called “out.dat”.

5. Finally, the decoding algorithm will come into play to decipher the codes contained in “out.dat”. The resulting decoded message should be written to an output file called “in2.dat”. If nothing goes wrong, the text stored in “in1.dat” and the one in “in2.dat” must match up correctly.