I am developing a web crawler using java. I have implemented it to some extent, like I have developed program which parses all the hyperlinks from the entered URL and and visits each link one by one and iterates this process. Now I want to parse all the visible text from a particular web page. I am facing problem in this. Can anyone suggest how to accomplish this. Any help wil be greatly appreciated.
Thanks in advance
I suppose it depends on what you want to do to the text, but you can look for certain tags that specify webpage text.
<p>, <table><othertags>...</table></othertags> etc... This might be a good application of regular expressions.
What is the overall goal? To store the parsed text into tables/data structures in some ordered way?
Dear Desh Banks,
Thanks a lot for your concern, but I have already solved this problem, if you want to help, then please help me in parsing of all the visible text from a particular webpage after retrieving the source code.