Results 1 to 3 of 3
  1. #1
    sara.quin is offline Member
    Join Date
    Mar 2016
    Posts
    1
    Rep Power
    0

    Default Entropy of a dictionary

    Hi, I have an english dictionary (a file that contains a list of words) and I want to calculate:

    1. given a path tree (a word), measure H(Cl+1|Cl=cl, Cl-1=cl-1, ...) for a few levels of the tree
    2. for each level of the tree calculate H(Cl+1|Cl, ...)
    3. using the tree model calculate the entropy of the dictionary, measured in bits per word.

    I built a trie data structure to represent the dictionary.
    So I have something like: http://i.stack.imgur.com/jqUVi.png


    I know that:
    H(X) = - sum_{i} P_i log2 Pi
    H(X|Y) = - sum_y P(y) H(X|Y=y)
    H(X1,X2,..,Xn|Y) = sum_{i=1}^n H(Xi|X1,...,Xi-1,Y)


    This is what I think to do:

    - (1) I think that is a conditioning of a single event. I have to calculate what is the entropy in the next letter I suppose to know the current one. How can I do?
    - (2) Now I have the conditioning of a random variable, not on a specific event. Again I don't know how to solve it.
    - (3) I know H(D) = H(C1C2...Cn)=sum{l=1}H(Cl|Cl-1Cl-1...C1)=H(C1)+H(C2|C1)+H(C3|C2C1)+...+H(C_l|C_l-1...C1)
    But in practice I don't know how to do..

    My classes are:
    Java Code:
    class TrieNode {
    
        public char content; 
        public boolean isEnd; 
        public double count; 
        public LinkedList<TrieNode> childList; 
        public static int id = 0;
    
        public TrieNode(char c) {
            childList = new LinkedList<TrieNode>();
            isEnd = false;
            content = c;
            count = 0;
        }
    
        public TrieNode subNode(char c) {
            if(childList != null)
                for(TrieNode eachChild : childList)
                    if(eachChild.content == c)
                        return eachChild;
            return null;
        }
    
        public void printNode() {
            String s = "";
            s += "{" + content + ", " + isEnd + ", " + count + ", [ ";
            for(TrieNode eachChild : childList) {
                s += eachChild.content + " ";
            }
            s += "]";
            System.out.println(s);
        }
    }

    Java Code:
    public class Trie {
    
    	final static String PATH_DICT = "./ENdictionary.ngl"; 
    	final static String OUTPUT_DICT = "./ENdictionaryClear.ngl"; 
    	public static FileWriter outFile;
    	public static PrintWriter writer;
    	public static PrintWriter out;
    	public double totChar = 0; 
    
    	private TrieNode root;
    	private double totWords;
     
    	public Trie() {
    		root = new TrieNode('*'); //blank for root 
    		totWords = 0;
    	}
    
    	public TrieNode getRoot() {
    		return root;
    	}
    
    	public double getTotWords() {
    		return totWords;
    	}
    
    	/** 
    	 * Function to insert word.
    	 */
    	public void insert(String word) {
    		if(search(word) == true)
    			return;
    		totChar += word.length(); 
    		TrieNode current = root;
    		for(char ch : word.toCharArray()) {
    			TrieNode child = current.subNode(ch); 
    			if(child != null) {
    				current = child;
    			}
    			else {
    				current.childList.add(new TrieNode(ch));
    				current = current.subNode(ch);
    			}
    			current.count++;
    		}
    		current.isEnd = true;
    	}
    	
    	/**
    	 * Function to search for word.
    	 */
    	public boolean search(String word) {
    		TrieNode current = root;  
    		for(char ch : word.toCharArray()) {
    			if(current.subNode(ch) == null)
    				return false;
    			else
    				current = current.subNode(ch);
    		}      
    		if(current.isEnd == true) 
    			return true;
    		return false;
    	}
    
    	/**
    	 * Function to remove a word.
    	 */
    	public void remove(String word) {
    		if(search(word) == false) {
    			System.out.println(word + " does not exist in trie.\n");
    			return;
    		}             
    		TrieNode current = root;
    		for(char ch : word.toCharArray()) { 
    			TrieNode child = current.subNode(ch);
    			if(child.count == 1) {
    				current.childList.remove(child);
    				return;
    			} 
    			else {
    				child.count--;
    				current = child;
    			}
    		}
    		current.isEnd = false;
    	}
    
    	/**
    	 * Create dictionary data structure (Trie) from the dictionary file.
    	 */
    	public void createDictionary() {
    		Pattern p = Pattern.compile("[^a-z]", Pattern.CASE_INSENSITIVE);
    		Matcher m;
    		try {
    			outFile = new FileWriter(OUTPUT_DICT, false); //per il txt
    			writer = new PrintWriter(outFile);
    			FileInputStream fstream = new FileInputStream(PATH_DICT); //open the file
    			BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
    			String word;
    			while((word = br.readLine()) != null) { //read File Line By Line
    				m = p.matcher(word);
    				if(!m.find()) { //se la stringa word contiene un numero o un carattere speciale non la considero
    					writer.println(word); //stampo su file il nuovo dizionario
    					totWords++;
    					insert(word); //creo il trie del dizionario
    				}
    			}
    			br.close(); //close the input stream
    		}
    		catch(IOException ioe) {
    			System.out.println(ioe);
    		}
    	}
    
    	public double entropyFirstChar() {
    		double[] p = new double[getRoot().childList.size()];
    		int i = 0; 
    		double totWords = getTotWords();
    		double entropy = 0.0;
    		for(TrieNode node : root.childList) { //per ogni figlio di *
    			p[i] = node.count / totWords;
    			i++;
    		}
    		for(int j = 0; j < p.length; j++) {
    			if(p[j] != 0) {
    				entropy -= p[j] * log2(p[j]);
    			}
    		}
    		return entropy;
    	}
    
    	public double entropySingleChar() {
    		double entropy = 0.0; 
    		double[] p = new double[Alphabet.values().length]; 
    		int i = 0;
    		for(Alphabet character : Alphabet.values()) { 
    			double occ = occurrencesOfChar(root, character.getC());
    			p[i] = occ / totChar;
    			i++;
    		}
    		for(int j = 0; j < p.length; j++) {
    			if(p[j] != 0) {
    				entropy -= p[j] * log2(p[j]);
    			}
    		}
    		return entropy;
    	}
    
    	public double occurrencesOfChar(TrieNode node, char c) {
    		double occ = 0;
    		for(TrieNode child : node.childList) { 
    			if(child.content == c) { 
    				occ += child.count; 
    			}
    			occ += occurrencesOfChar(child, c);
    		}
    		return occ;
    	}
    
    	public static double log2(double n) {
        	    return (Math.log(n) / Math.log(2));
    	}
    
    }

    Thanks!

  2. #2
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    Eastern Florida
    Posts
    20,003
    Rep Power
    33

    Default Re: Entropy of a dictionary

    If you don't understand my response, don't ignore it, ask a question.

  3. #3
    SurfMan's Avatar
    SurfMan is offline Godlike
    Join Date
    Nov 2012
    Location
    The Netherlands
    Posts
    1,993
    Rep Power
    9

    Default Re: Entropy of a dictionary

    "It's not fixed until you stop calling the problem weird and you understand what was wrong." - gimbal2 2013

Similar Threads

  1. Referencing a dictionary
    By DarkD in forum Forum Lobby
    Replies: 1
    Last Post: 03-10-2014, 12:08 AM
  2. Using XML as a dictionary for NLP project
    By MiamiCane99 in forum XML
    Replies: 0
    Last Post: 02-17-2012, 09:32 PM
  3. I need a little direction for a dictionary app
    By Alexis in forum AWT / Swing
    Replies: 5
    Last Post: 02-11-2011, 08:43 PM
  4. dictionary
    By aizen92 in forum New To Java
    Replies: 49
    Last Post: 01-01-2011, 10:07 AM
  5. add dictionary
    By monir6464 in forum New To Java
    Replies: 2
    Last Post: 04-07-2008, 07:27 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •