Results 1 to 10 of 10
  1. #1
    LucasH is offline Member
    Join Date
    Mar 2011
    Posts
    11
    Rep Power
    0

    Default Parsing a web page: imitating a browser as cllse as possible

    Hi guys!
    I'm creating a piece of software that basically crawls through several pages getting ALL the links on the page.
    You may say "yeah, that's easy... you can use HtmlParser, JerichoParser, etc.", but the problem is that those are decent parsers for HTML links but VERY SLOW parsers for links on JavaScript scripts and other technologies.
    That's the reason of my title. I need to crawl through ALL of the links on a page, but on a decent speed... JUST as a browser does.

    My questions are:
    1. Recommendations on a parser to do this job?
    2. Which variety of links and technologies will I encounter (besides HTML and JS)? I need a parser that handles all of them on an efficient way.


    Thanks in advance!
    -Lucas

  2. #2
    LucasH is offline Member
    Join Date
    Mar 2011
    Posts
    11
    Rep Power
    0

    Default Re: Parsing a web page: imitating a browser as cllse as possible

    can anyone lend me a hand on this one? :S

  3. #3
    LucasH is offline Member
    Join Date
    Mar 2011
    Posts
    11
    Rep Power
    0

    Default Re: Parsing a web page: imitating a browser as cllse as possible

    anyone knows WHERE can I look for help regarding this issue?
    thanks in advance

  4. #4
    Tolls is offline Moderator
    Join Date
    Apr 2009
    Posts
    12,044
    Rep Power
    20

    Default Re: Parsing a web page: imitating a browser as cllse as possible

    How will you identify links in Javascript?

  5. #5
    LucasH is offline Member
    Join Date
    Mar 2011
    Posts
    11
    Rep Power
    0

    Default Re: Parsing a web page: imitating a browser as cllse as possible

    Hi Tolls
    Right now, I'm using JerichoParser to do that job... but it's too slow.
    With Jericho I basically open all <script type="text/javascript"> src attribute.

  6. #6
    Tolls is offline Moderator
    Join Date
    Apr 2009
    Posts
    12,044
    Rep Power
    20

    Default Re: Parsing a web page: imitating a browser as cllse as possible

    What about ones that are built up?
    Or held as constants in chunks?

    Just trying to figures out how far you are expecting to go with this.

  7. #7
    LucasH is offline Member
    Join Date
    Mar 2011
    Posts
    11
    Rep Power
    0

    Default Re: Parsing a web page: imitating a browser as cllse as possible

    Can you give an example? Im not sure I get your point

  8. #8
    Tolls is offline Moderator
    Join Date
    Apr 2009
    Posts
    12,044
    Rep Power
    20

    Default Re: Parsing a web page: imitating a browser as cllse as possible

    var SOME_TARGET = "something.action";

    later on (in some other js file):
    callAjax(SOME_TARGET + "<a load of parameters>");

    This is possibly why Jericho takes so long.
    Your browser only has to translate that at the point it's called, but a web page might have a ton of code that does this sort of thing that Jericho then has to figure out.

  9. #9
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,570
    Blog Entries
    7
    Rep Power
    21

    Default Re: Parsing a web page: imitating a browser as cllse as possible

    I think it's an undecidable problem because you actually have to execute/interpret the Javascript code in order to find out what the actual links are. It is equivalent to a Turing Halting Problem and you don't want to go there.

    kind regards,

    Jos
    cenosillicaphobia: the fear for an empty beer glass

  10. #10
    DarrylBurke's Avatar
    DarrylBurke is offline Forum Police
    Join Date
    Sep 2008
    Location
    Madgaon, Goa, India
    Posts
    11,304
    Rep Power
    20

    Default Re: Parsing a web page: imitating a browser as cllse as possible

    Spammy link removed. ValeryB, there's a section of the forums for advertising. This isn't it.

    db
    If you're forever cleaning cobwebs, it's time to get rid of the spiders.

Similar Threads

  1. Replies: 0
    Last Post: 11-17-2011, 04:52 AM
  2. HTML web page parsing scraping
    By francojava1 in forum Advanced Java
    Replies: 0
    Last Post: 10-22-2010, 04:08 PM
  3. html web page parsing/scraping
    By orchid in forum Advanced Java
    Replies: 3
    Last Post: 10-21-2010, 01:34 PM
  4. Replies: 3
    Last Post: 11-09-2009, 12:12 AM
  5. Error while parsing html page in java on linux
    By rdhaware in forum Advanced Java
    Replies: 1
    Last Post: 02-20-2009, 02:20 AM

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •