Results 1 to 10 of 10
  1. #1
    dinosoep is offline Senior Member
    Join Date
    Nov 2009
    Posts
    150
    Rep Power
    5

    Unhappy html, biggest nightmare ever

    here it goes:
    I am working on a major project and currently parsing xml files.
    in the xml files there are some tags containing html.
    not normal html but BAD html, the worst possible html ever written.
    I want to take out the image tags.
    I know its impossible to save all the image tags because some are just terrible, I mean terrible written but I want to save 99%

    How would you do it?
    save my day :)

  2. #2
    Tolls is online now Moderator
    Join Date
    Apr 2009
    Posts
    12,091
    Rep Power
    20

    Default

    Is the file valid xml?
    Or are you saying by "bad" that it is not valid, ie parseable?

  3. #3
    dinosoep is offline Senior Member
    Join Date
    Nov 2009
    Posts
    150
    Rep Power
    5

    Default

    the xml file is valid, its something like this:
    <files>
    <file>
    <name>file1</name>
    <contents>BAD HTML</contents>
    </file>
    <file>
    <name>file2</name>
    <contents>BAD HTML</contents>
    </file>
    </files>

    now from that html i have to filter the image tags

  4. #4
    Tolls is online now Moderator
    Join Date
    Apr 2009
    Posts
    12,091
    Rep Power
    20

    Default

    Is the HTML valid?
    That is, could you extract the CDATA (I presume it's that) from the contents tags and parse those successfully?

  5. #5
    dinosoep is offline Senior Member
    Join Date
    Nov 2009
    Posts
    150
    Rep Power
    5

    Default

    which parser do you suggest?
    it misses sometimes end tags, sometimes you get something like <img src =
    "http...
    if you understand what i mean.
    the parser you have in mind supports bad html?
    whats the best out there?

  6. #6
    Tolls is online now Moderator
    Join Date
    Apr 2009
    Posts
    12,091
    Rep Power
    20

    Default

    So the HTML isn't valid then.
    In which case all you can do is manually extract it I suppose, unless someone else has a better idea. I've no had to deal with bad XML, other than to reject it.
    So you'd be manually trawling through String in <contents> and looking for <img as a start point, and then reading until you get all the data you need, possibly?

  7. #7
    dinosoep is offline Senior Member
    Join Date
    Nov 2009
    Posts
    150
    Rep Power
    5

    Default

    thats what I tried, no succes.
    annyway thanks for the help.
    how do I delete this topic?

  8. #8
    Tolls is online now Moderator
    Join Date
    Apr 2009
    Posts
    12,091
    Rep Power
    20

    Default

    "no success"?
    That's a bit defeatist...unless the HTML is so bad it wouldn't actually display, I can't see why you can't extract <img> tags from it.

    And why would you want to delete the topic?

  9. #9
    dinosoep is offline Senior Member
    Join Date
    Nov 2009
    Posts
    150
    Rep Power
    5

    Default

    with no succes I mean, writing my own scanner hasn't got succes.
    I started searching libaries and finally found one that fit the needs, problem close :)
    if anyone intersted: html cleaner

  10. #10
    Tolls is online now Moderator
    Join Date
    Apr 2009
    Posts
    12,091
    Rep Power
    20

Similar Threads

  1. html
    By Srikala in forum Forum Lobby
    Replies: 1
    Last Post: 10-07-2008, 09:28 AM
  2. How can I include a html file in html textarea?
    By surya_dks in forum New To Java
    Replies: 2
    Last Post: 10-04-2008, 07:20 AM
  3. JSplitPane nightmare
    By SwinGirl in forum SWT / JFace
    Replies: 1
    Last Post: 05-01-2008, 08:24 PM
  4. HTML to PDF
    By Heather in forum New To Java
    Replies: 1
    Last Post: 07-08-2007, 01:24 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •