# Thread: Entropy of a dictionary

1. Member Join Date
Mar 2016
Posts
1
Rep Power
0

## Entropy of a dictionary

Hi, I have an english dictionary (a file that contains a list of words) and I want to calculate:

1. given a path tree (a word), measure H(Cl+1|Cl=cl, Cl-1=cl-1, ...) for a few levels of the tree
2. for each level of the tree calculate H(Cl+1|Cl, ...)
3. using the tree model calculate the entropy of the dictionary, measured in bits per word.

I built a trie data structure to represent the dictionary.
So I have something like: http://i.stack.imgur.com/jqUVi.png

I know that:
H(X) = - sum_{i} P_i log2 Pi
H(X|Y) = - sum_y P(y) H(X|Y=y)
H(X1,X2,..,Xn|Y) = sum_{i=1}^n H(Xi|X1,...,Xi-1,Y)

This is what I think to do:

- (1) I think that is a conditioning of a single event. I have to calculate what is the entropy in the next letter I suppose to know the current one. How can I do?
- (2) Now I have the conditioning of a random variable, not on a specific event. Again I don't know how to solve it.
- (3) I know H(D) = H(C1C2...Cn)=sum{l=1}H(Cl|Cl-1Cl-1...C1)=H(C1)+H(C2|C1)+H(C3|C2C1)+...+H(C_l|C_l-1...C1)
But in practice I don't know how to do..

My classes are:
Java Code:
```class TrieNode {

public char content;
public boolean isEnd;
public double count;
public static int id = 0;

public TrieNode(char c) {
isEnd = false;
content = c;
count = 0;
}

public TrieNode subNode(char c) {
if(childList != null)
for(TrieNode eachChild : childList)
if(eachChild.content == c)
return eachChild;
return null;
}

public void printNode() {
String s = "";
s += "{" + content + ", " + isEnd + ", " + count + ", [ ";
for(TrieNode eachChild : childList) {
s += eachChild.content + " ";
}
s += "]";
System.out.println(s);
}
}```

Java Code:
```public class Trie {

final static String PATH_DICT = "./ENdictionary.ngl";
final static String OUTPUT_DICT = "./ENdictionaryClear.ngl";
public static FileWriter outFile;
public static PrintWriter writer;
public static PrintWriter out;
public double totChar = 0;

private TrieNode root;
private double totWords;

public Trie() {
root = new TrieNode('*'); //blank for root
totWords = 0;
}

public TrieNode getRoot() {
return root;
}

public double getTotWords() {
}

/**
* Function to insert word.
*/
public void insert(String word) {
if(search(word) == true)
return;
totChar += word.length();
TrieNode current = root;
for(char ch : word.toCharArray()) {
TrieNode child = current.subNode(ch);
if(child != null) {
current = child;
}
else {
current = current.subNode(ch);
}
current.count++;
}
current.isEnd = true;
}

/**
* Function to search for word.
*/
public boolean search(String word) {
TrieNode current = root;
for(char ch : word.toCharArray()) {
if(current.subNode(ch) == null)
return false;
else
current = current.subNode(ch);
}
if(current.isEnd == true)
return true;
return false;
}

/**
* Function to remove a word.
*/
public void remove(String word) {
if(search(word) == false) {
System.out.println(word + " does not exist in trie.\n");
return;
}
TrieNode current = root;
for(char ch : word.toCharArray()) {
TrieNode child = current.subNode(ch);
if(child.count == 1) {
current.childList.remove(child);
return;
}
else {
child.count--;
current = child;
}
}
current.isEnd = false;
}

/**
* Create dictionary data structure (Trie) from the dictionary file.
*/
public void createDictionary() {
Pattern p = Pattern.compile("[^a-z]", Pattern.CASE_INSENSITIVE);
Matcher m;
try {
outFile = new FileWriter(OUTPUT_DICT, false); //per il txt
writer = new PrintWriter(outFile);
FileInputStream fstream = new FileInputStream(PATH_DICT); //open the file
String word;
m = p.matcher(word);
if(!m.find()) { //se la stringa word contiene un numero o un carattere speciale non la considero
writer.println(word); //stampo su file il nuovo dizionario
totWords++;
insert(word); //creo il trie del dizionario
}
}
br.close(); //close the input stream
}
catch(IOException ioe) {
System.out.println(ioe);
}
}

public double entropyFirstChar() {
double[] p = new double[getRoot().childList.size()];
int i = 0;
double totWords = getTotWords();
double entropy = 0.0;
for(TrieNode node : root.childList) { //per ogni figlio di *
p[i] = node.count / totWords;
i++;
}
for(int j = 0; j < p.length; j++) {
if(p[j] != 0) {
entropy -= p[j] * log2(p[j]);
}
}
return entropy;
}

public double entropySingleChar() {
double entropy = 0.0;
double[] p = new double[Alphabet.values().length];
int i = 0;
for(Alphabet character : Alphabet.values()) {
double occ = occurrencesOfChar(root, character.getC());
p[i] = occ / totChar;
i++;
}
for(int j = 0; j < p.length; j++) {
if(p[j] != 0) {
entropy -= p[j] * log2(p[j]);
}
}
return entropy;
}

public double occurrencesOfChar(TrieNode node, char c) {
double occ = 0;
for(TrieNode child : node.childList) {
if(child.content == c) {
occ += child.count;
}
occ += occurrencesOfChar(child, c);
}
return occ;
}

public static double log2(double n) {
return (Math.log(n) / Math.log(2));
}

}```

Thanks!  Reply With Quote

2. ## Re: Entropy of a dictionary  Reply With Quote

3. ## Re: Entropy of a dictionary  Reply With Quote

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•