Results 1 to 10 of 10
- 11-30-2011, 11:18 PM #1
Member
- Join Date
- Mar 2011
- Posts
- 11
- Rep Power
- 0
Parsing a web page: imitating a browser as cllse as possible
Hi guys!
I'm creating a piece of software that basically crawls through several pages getting ALL the links on the page.
You may say "yeah, that's easy... you can use HtmlParser, JerichoParser, etc.", but the problem is that those are decent parsers for HTML links but VERY SLOW parsers for links on JavaScript scripts and other technologies.
That's the reason of my title. I need to crawl through ALL of the links on a page, but on a decent speed... JUST as a browser does.
My questions are:
1. Recommendations on a parser to do this job?
2. Which variety of links and technologies will I encounter (besides HTML and JS)? I need a parser that handles all of them on an efficient way.
Thanks in advance!
-Lucas
- 12-06-2011, 02:05 AM #2
Member
- Join Date
- Mar 2011
- Posts
- 11
- Rep Power
- 0
Re: Parsing a web page: imitating a browser as cllse as possible
can anyone lend me a hand on this one? :S
- 12-14-2011, 02:29 AM #3
Member
- Join Date
- Mar 2011
- Posts
- 11
- Rep Power
- 0
Re: Parsing a web page: imitating a browser as cllse as possible
anyone knows WHERE can I look for help regarding this issue?
thanks in advance
- 12-14-2011, 02:52 PM #4
Moderator
- Join Date
- Apr 2009
- Posts
- 10,481
- Rep Power
- 16
Re: Parsing a web page: imitating a browser as cllse as possible
How will you identify links in Javascript?
- 12-20-2011, 02:48 AM #5
Member
- Join Date
- Mar 2011
- Posts
- 11
- Rep Power
- 0
Re: Parsing a web page: imitating a browser as cllse as possible
Hi Tolls
Right now, I'm using JerichoParser to do that job... but it's too slow.
With Jericho I basically open all <script type="text/javascript"> src attribute.
- 12-20-2011, 09:51 AM #6
Moderator
- Join Date
- Apr 2009
- Posts
- 10,481
- Rep Power
- 16
Re: Parsing a web page: imitating a browser as cllse as possible
What about ones that are built up?
Or held as constants in chunks?
Just trying to figures out how far you are expecting to go with this.
- 12-20-2011, 02:38 PM #7
Member
- Join Date
- Mar 2011
- Posts
- 11
- Rep Power
- 0
Re: Parsing a web page: imitating a browser as cllse as possible
Can you give an example? Im not sure I get your point
- 12-20-2011, 04:05 PM #8
Moderator
- Join Date
- Apr 2009
- Posts
- 10,481
- Rep Power
- 16
Re: Parsing a web page: imitating a browser as cllse as possible
var SOME_TARGET = "something.action";
later on (in some other js file):
callAjax(SOME_TARGET + "<a load of parameters>");
This is possibly why Jericho takes so long.
Your browser only has to translate that at the point it's called, but a web page might have a ton of code that does this sort of thing that Jericho then has to figure out.
- 12-20-2011, 04:25 PM #9
- Join Date
- Sep 2008
- Location
- Voorschoten, the Netherlands
- Posts
- 11,406
- Blog Entries
- 7
- Rep Power
- 17
Re: Parsing a web page: imitating a browser as cllse as possible
I think it's an undecidable problem because you actually have to execute/interpret the Javascript code in order to find out what the actual links are. It is equivalent to a Turing Halting Problem and you don't want to go there.
kind regards,
JosWhen people rob a bank they get a penalty; when banks rob people they get a bonus.
- 01-27-2012, 05:27 PM #10
Similar Threads
-
Could not access the URL through the external browser. Check the browser configuratio
By volkvanmyn25 in forum New To JavaReplies: 0Last Post: 11-17-2011, 04:52 AM -
HTML web page parsing scraping
By francojava1 in forum Advanced JavaReplies: 0Last Post: 10-22-2010, 04:08 PM -
html web page parsing/scraping
By orchid in forum Advanced JavaReplies: 3Last Post: 10-21-2010, 01:34 PM -
Is This Possible?? Onload, Open Page in New Browser and Size?
By MusicGuy in forum New To JavaReplies: 3Last Post: 11-09-2009, 12:12 AM -
Error while parsing html page in java on linux
By rdhaware in forum Advanced JavaReplies: 1Last Post: 02-20-2009, 02:20 AM


LinkBack URL
About LinkBacks
Reply With Quote

Bookmarks