Results 1 to 12 of 12
  1. #1
    Dodo is offline Member
    Join Date
    Nov 2009
    Posts
    10
    Rep Power
    0

    Default Words occurrence counter for any web page

    Please, does anybody have a sample source code of the following issue? I don't know how to build such application. My conception is, I would like to create a program that searches any English web page (URL) and shows me 10 most frequently words occurred on that web page. The program should be able to search words up to fifth level (maximally) of required page.

    I suppose, I would use classes (e.g. URI) from:

    import java.io.*;
    import java.net.*;
    import java.util.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;
    import javax.swing.text.html.parser.*;

    Furthermore, I should use:

    HTMLEditorKit.ParserCallback
    HashSet<E>
    LinkedList<E>

    and others...

    Thanks for any help to me.

    Dodo

  2. #2
    Eranga's Avatar
    Eranga is offline Moderator
    Join Date
    Jul 2007
    Location
    Colombo, Sri Lanka
    Posts
    11,372
    Blog Entries
    1
    Rep Power
    20

    Default

    Better to step down your requirement first I guess.

    First of all think that how to read the html file. So you are going to use a URL to locate the file. In that case best thing is download the web page temporary to local machine. Then read the file and collect content. Depends on the size of the file you may take different approach.

    In that case keep in mind not read entire file to the memory at once. Normally what I'm doing is that use the Iterator interface, read line by line. BufferedReader fine.

    Then think about how the search implementation do.

    Hope it's helpful to you.

  3. #3
    Dodo is offline Member
    Join Date
    Nov 2009
    Posts
    10
    Rep Power
    0

    Unhappy Words occurrence counter for any web page

    Hello,

    I have certain construction of my source code for better understanding, where I don't know how to implement some methods as described below. Can anybody help me to resolve this problem?

    Java Code:
    import java.io.*;
    import java.net.*;
    import java.util.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;
    import javax.swing.text.html.parser.*;
    
    class URIinfo {
    	
    	URI uri;
    	int depth;
    	
    	URIinfo(URI uri, int depth) {
    		this.uri = uri;
    		this.depth = depth;
    	}
    	
    	URIinfo(String str, int depth) {
    		
    		try {
    			this.uri = new URI(str);
    		} catch (URISyntaxException e) {
    			e.printStackTrace();
    		}
    		
    		this.depth = depth;
    	}
    }
    
    class ParserCallback extends HTMLEditorKit.ParserCallback {
    	
    	URI pageURI;
    	
    	int depth = 0, maxDepth = 5;
    		
    	HashSet<URI> visitedURIs;
    	LinkedList<URIinfo> foundURIs;
    	
    	int debugLevel = 0;
    	
    	ParserCallback (HashSet<URI> visitedURIs, LinkedList<URIinfo> foundURIs) {
    		this.foundURIs = foundURIs;
    		this.visitedURIs = visitedURIs;
    	}
    	
    	public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
    		handleStartTag(t, a, pos);
    	}
    	
    	public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
    		URI uri;
    		String href = null;
    		
    		if (debugLevel > 1)
    			System.err.println("handleStartTag: " + t.toString() + ", pos=" + pos + ", attribs=" + a.toString());
    		
    		if (depth <= maxDepth)
    			
    			if (t == HTML.Tag.A) href = (String)a.getAttribute(HTML.Attribute.HREF);
        	 	
    			else if (t == HTML.Tag.FRAME) href = (String)a.getAttribute(HTML.Attribute.SRC);
    			
    			if (href != null)
    				
    				try {
    					uri = pageURI.resolve(href);
    					
    					if (!uri.isOpaque() && !visitedURIs.contains(uri)) {
    						visitedURIs.add(uri);
    						foundURIs.add(new URIinfo(uri, depth+1));
    						
    						if (debugLevel > 0)
    				    		 System.err.println("Adding URI: " + uri.toString());
    					}
    				} catch (Exception e) {
    					System.err.println("Found incorrect URI: " + href);
    					e.printStackTrace();
    				}
    	}
    	
    	
         public void handleText(char[] data, int pos) {
    		 System.out.println("handleText: " + String.valueOf(data) + ", pos = " + pos);
    		 
    		 /*
    		  * What should I implement here?
    		  */
         }
    }
    
    public class Parser {
    	
    	public static void main(String[] args) {
    		
    		URI uri;
    		HashSet<URI> visitedURIs = new HashSet<URI>();
    		LinkedList<URIinfo> foundURIs = new LinkedList<URIinfo>();
    		
    		try {
    			uri = new URI(args[0] + "/");
    			foundURIs.add(new URIinfo(uri, 0));
    			visitedURIs.add(uri);
    			
    			if (args.length < 1) {
    				System.err.println("Missing parameter - start URL");
    				return;
    			}
    			
    			/*
    			 * How should I implement maxDepth and debugLevel... here?
    			 */
    			
    			ParserCallback callBack = new ParserCallback(visitedURIs, foundURIs);
    			ParserDelegator parser = new ParserDelegator();
    			
    			while (!foundURIs.isEmpty()) {
    				URIinfo URIinfo = foundURIs.removeFirst();
    				callBack.depth = URIinfo.depth;
    				callBack.pageURI = uri = URIinfo.uri;
    				System.err.println("Analyzing " + uri);
    				
    				try {
    					BufferedReader reader = new BufferedReader(new InputStreamReader(uri.toURL().openStream()));
    					parser.parse(reader, callBack, true);
    					reader.close();
    				} catch (FileNotFoundException e) {
    					System.err.println("Error loading page - does it exist?");
    				}
    			}
    		} catch (Exception e) {
    			System.err.println("Exteption, Exit!");
    			e.printStackTrace();
    		}
    	}
    }
    Thanks, Dodo
    Last edited by Eranga; 11-05-2009 at 04:41 AM. Reason: code tags added.

  4. #4
    Eranga's Avatar
    Eranga is offline Moderator
    Join Date
    Jul 2007
    Location
    Colombo, Sri Lanka
    Posts
    11,372
    Blog Entries
    1
    Rep Power
    20

  5. #5
    Dodo is offline Member
    Join Date
    Nov 2009
    Posts
    10
    Rep Power
    0

    Default Words occurrence counter for any web page

    Dear Eranga,

    thank you for a reformat of my source code, it really looks better, I forgot to do it yesterday, sorry :)

    I'm a student of university, I'm a beginner at programming in Java. I have got a homework to complete certain parts of the mentioned code. Do you know, how to resolve it, or to navigate me to the right way?

    Have a nice day.

    Michal

  6. #6
    Eranga's Avatar
    Eranga is offline Moderator
    Join Date
    Jul 2007
    Location
    Colombo, Sri Lanka
    Posts
    11,372
    Blog Entries
    1
    Rep Power
    20

    Default

    Yes I can help you. If you can ask your question specifically it's much easier to us Michal. Going through the code and fixing here and there is not make sense to both of us I guess.

    Where you stuck with?

  7. #7
    Dodo is offline Member
    Join Date
    Nov 2009
    Posts
    10
    Rep Power
    0

    Default Words occurrence counter for any web page

    Dear Eranga,

    sorry for my bad query formulation. I hope, my improvement comes here :)

    I would like to clarify my requirement what I really meant. I'm a beginner at Java programming as you know. There is a method "public void handleText(char[] data, int pos)" defined in my source code below that is empty. I don't know how to create the body that shows 20 the most frequent words (string counter) on a screen from parsed HTML page that is read as URI or URL at a command line as an argument, together with optional arguments "maxDepth" (parsing nesting levels, 0-5 depths) and "debugLevel" as a printout to the standard error output. Parameter typed as "maxDepth" and "debugLevel", they shoud be implemented in the "Parser" class, in the "main" method. There are just two places where I need to complete the mentioned implementations, they are marked as commentary.

    I hope, you understand my problem now. If you know any solution, please let me know, thank you.

    Best regards,

    Michal

    Java Code:
    import java.io.*;
    import java.net.*;
    import java.util.*;
    import javax.swing.text.*;
    import javax.swing.text.html.*;
    import javax.swing.text.html.parser.*;
    
    class URIinfo {
    	
    	URI uri;
    	int depth;
    	
    	URIinfo(URI uri, int depth) {
    		this.uri = uri;
    		this.depth = depth;
    	}
    	
    	URIinfo(String str, int depth) {
    		
    		try {
    			this.uri = new URI(str);
    		} catch (URISyntaxException e) {
    			e.printStackTrace();
    		}
    		
    		this.depth = depth;
    	}
    }
    
    class ParserCallback extends HTMLEditorKit.ParserCallback {
    	
    	URI pageURI;
    	
    	int depth = 0, maxDepth = 5;
    		
    	HashSet<URI> visitedURIs;
    	LinkedList<URIinfo> foundURIs;
    	
    	int debugLevel = 0;
    	
    	ParserCallback (HashSet<URI> visitedURIs, LinkedList<URIinfo> foundURIs) {
    		this.foundURIs = foundURIs;
    		this.visitedURIs = visitedURIs;
    	}
    	
    	public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
    		handleStartTag(t, a, pos);
    	}
    	
    	public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
    		URI uri;
    		String href = null;
    		
    		if (debugLevel > 1)
    			System.err.println("handleStartTag: " + t.toString() + ", pos=" + pos + ", attribs=" + a.toString());
    		
    		if (depth <= maxDepth)
    			
    			if (t == HTML.Tag.A) href = (String)a.getAttribute(HTML.Attribute.HREF);
        	 	
    			else if (t == HTML.Tag.FRAME) href = (String)a.getAttribute(HTML.Attribute.SRC);
    			
    			if (href != null)
    				
    				try {
    					uri = pageURI.resolve(href);
    					
    					if (!uri.isOpaque() && !visitedURIs.contains(uri)) {
    						visitedURIs.add(uri);
    						foundURIs.add(new URIinfo(uri, depth+1));
    						
    						if (debugLevel > 0)
    				    		 System.err.println("Adding URI: " + uri.toString());
    					}
    				} catch (Exception e) {
    					System.err.println("Found incorrect URI: " + href);
    					e.printStackTrace();
    				}
    	}
    	
    	
         public void handleText(char[] data, int pos) {
    		 System.out.println("handleText: " + String.valueOf(data) + ", pos = " + pos);
    		 
    		 /*
    		  * What should I implement here?
    		  */
         }
    }
    
    public class Parser {
    	
    	public static void main(String[] args) {
    		
    		URI uri;
    		HashSet<URI> visitedURIs = new HashSet<URI>();
    		LinkedList<URIinfo> foundURIs = new LinkedList<URIinfo>();
    		
    		try {
    			uri = new URI(args[0] + "/");
    			foundURIs.add(new URIinfo(uri, 0));
    			visitedURIs.add(uri);
    			
    			if (args.length < 1) {
    				System.err.println("Missing parameter - start URL");
    				return;
    			}
    			
    			/*
    			 * How should I implement maxDepth and debugLevel... here?
    			 */
    			
    			ParserCallback callBack = new ParserCallback(visitedURIs, foundURIs);
    			ParserDelegator parser = new ParserDelegator();
    			
    			while (!foundURIs.isEmpty()) {
    				URIinfo URIinfo = foundURIs.removeFirst();
    				callBack.depth = URIinfo.depth;
    				callBack.pageURI = uri = URIinfo.uri;
    				System.err.println("Analyzing " + uri);
    				
    				try {
    					BufferedReader reader = new BufferedReader(new InputStreamReader(uri.toURL().openStream()));
    					parser.parse(reader, callBack, true);
    					reader.close();
    				} catch (FileNotFoundException e) {
    					System.err.println("Error loading page - does it exist?");
    				}
    			}
    		} catch (Exception e) {
    			System.err.println("Exteption, Exit!");
    			e.printStackTrace();
    		}
    	}
    }

  8. #8
    Eranga's Avatar
    Eranga is offline Moderator
    Join Date
    Jul 2007
    Location
    Colombo, Sri Lanka
    Posts
    11,372
    Blog Entries
    1
    Rep Power
    20

    Default

    First of all in the Parser class this should be correct.

    Java Code:
                uri = new URI(args[0] + "/");
                foundURIs.add(new URIinfo(uri, 0));
                visitedURIs.add(uri);
    
                if (args.length < 1) {
                    System.err.println("Missing parameter - start URL");
                    return;
                }
    Validation should be done before use the resource. Basically you've refer the argument without testing either they are exist or not, and then you test it in the if condition.

  9. #9
    Dodo is offline Member
    Join Date
    Nov 2009
    Posts
    10
    Rep Power
    0

    Default Words occurrence counter for any web page

    Hi Eranga,

    I know, I should validate the source code below in first, it isn't important now, I will do it later.

    Java Code:
                uri = new URI(args[0] + "/");
                foundURIs.add(new URIinfo(uri, 0));
                visitedURIs.add(uri);
    
                if (args.length < 1) {
                    System.err.println("Missing parameter - start URL");
                    return;
                }

    There is a different problem that I need to resolve. I should implement a body of the function named "public void handleText(char[] data, int pos)". Just my opinion, there should be implemented an instance of the "HashMap<String, Integer>" here as follows.

    Java Code:
    class ParserCallback extends HTMLEditorKit.ParserCallback {
    	
    	...
    	
    	HashMap<String,Integer> foundWords;
    	
    	...
    	
    	public void handleText(char[] data, int pos) {
    		 System.out.println("handleText: "+String.valueOf(data)+", pos = "+pos);
    		 foundWords.put(String.valueOf(data), 1);   // is it correct?
    	}
    }

    The next step should be an implementation of this HashMap in the class named Parser. It means to sort the map's values according to the keys, then select 20 the most frequent words and print them on the screen. I don't know how to realize it here. Do you have any idea?

    Have a nice day.

    Michal
    Last edited by Dodo; 11-08-2009 at 09:38 PM.

  10. #10
    Eranga's Avatar
    Eranga is offline Moderator
    Join Date
    Jul 2007
    Location
    Colombo, Sri Lanka
    Posts
    11,372
    Blog Entries
    1
    Rep Power
    20

    Default

    Okay, if you don't want to do those minor fixes now in your code I don't mind. :) But the best things would be care of all those things when you coding itself. That's what I found when I go through the code.

    About your original question. Did you read about handleText method in Java API?

    Java Code:
    public void handleText(char[] data, int pos) {
    		 System.out.println("handleText: "+String.valueOf(data)+", pos = "+pos);
    		 foundWords.put(String.valueOf(data), 1);   // is it correct?
    	}
    What char array contain here? Set of characters in the HTML page, collect from a specific location and the position given by variable pos. So to store the words in what ever a collection class, you had to have words on the HTML page. How can you find those things here? That's the logical things you've to solve first. Did you get my point?

  11. #11
    Dodo is offline Member
    Join Date
    Nov 2009
    Posts
    10
    Rep Power
    0

    Default Words occurrence counter for any web page

    Hi Eranga,

    as I mentioned last time, I'm a very busy man these days. I don't have a time to fully study Java programming, but I plan to do it in near future :)

    I haven't seen the "handleText" method in Java API yet, I'll do it :)

    Now, I have to complete a different task I got at university, again to build a program at Java. I'm going to try it myself in first, if I have a problem with it, I'll newly post it here.

    Bye, Michal

  12. #12
    Eranga's Avatar
    Eranga is offline Moderator
    Join Date
    Jul 2007
    Location
    Colombo, Sri Lanka
    Posts
    11,372
    Blog Entries
    1
    Rep Power
    20

    Default

    Yes you are welcome to post new questions any time, by putting some effort first lol. You may busy with your duties, same as us, so we want to help in out best. Personally I want to see that you are learning something new all the time.

Similar Threads

  1. Regex Pattern/Matcher - Print only one occurrence!
    By racha0601 in forum Advanced Java
    Replies: 3
    Last Post: 04-06-2009, 06:05 PM
  2. Counter
    By ks1615 in forum New To Java
    Replies: 6
    Last Post: 02-20-2009, 04:02 AM
  3. Replies: 1
    Last Post: 02-11-2009, 07:54 AM
  4. Frequency Counter
    By justlearning in forum New To Java
    Replies: 0
    Last Post: 05-07-2008, 11:50 PM
  5. Help with static variable counter
    By silvia in forum New To Java
    Replies: 1
    Last Post: 07-19-2007, 08:53 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •