Results 1 to 8 of 8
  1. #1
    mohammedsk is offline Member
    Join Date
    Jun 2010
    Posts
    4
    Rep Power
    0

    Default Match a word between any two HTML tags

    Hi,
    I new to Java regular expression and I have been reading a lot about it trying to understand it but it is still puzzling me.

    Here is my problem:
    I am given an HTML code and a keyword to search. The keyword has to be between two arbitrary HTML tags, and I would like to know if you I have found any matches.

    Example:
    html = "<html><head><title>Java reg exp</title></head>
    <body>
    <h1>Short Bio</h1>
    </body></html>"

    another example:
    html = "<html><head><title>Java reg exp</title></head>
    <body>
    <a href="link.html">Short Bio</a>
    </body></html>"

    My regular expression attempt is: <[^>]+>.*?Bio.*?</[^>]+>
    Of course it does not work because it will first match the <html> tag.

    My goal is to find out that the word "Bio" exist in the HTML code and it is between two HTML tags. The HTML tags are arbitrary, I do not have a list for tags to match.

    I have searched a lot, some people online suggested to use HTML parser, but those HTML parsers do not have the capabilities to search the text.

    Any suggestions are appreciated

  2. #2
    Zack's Avatar
    Zack is offline Senior Member
    Join Date
    Jun 2010
    Location
    Destiny Islands
    Posts
    692
    Rep Power
    5

    Default

    This might help you:
    Regular Expression HOWTO
    (It is a Python tutorial I think, but it's about RegEx in general.)

    Basically, what is going on is that RegEx is greedy; it wants to match as much text as possible. By using non-greedy identifiers, you can tell it to match minimal text, which should then find the nearest <> tag before Bio instead of the furthest.

  3. #3
    curmudgeon is offline Senior Member
    Join Date
    May 2010
    Posts
    436
    Rep Power
    5

    Default

    You may need to use groups so that the text in the opening and closing tags match.

  4. #4
    mohammedsk is offline Member
    Join Date
    Jun 2010
    Posts
    4
    Rep Power
    0

    Default

    Thanks for the responses

    Is it (.*?) around the word "Bio" that has to be replaced with something else to make it non-greedy? At least that is my guess

    I am getting there, here are the new changes:
    (?i)<([^>]+)>[^<]*Bio.*</([^>]+)>

    it matched the beginning, but I still need to work on the ending tag.
    I do not understand why when I place [^>]* after "Bio" it stops matching.
    Last edited by mohammedsk; 06-19-2010 at 04:13 AM.

  5. #5
    mohammedsk is offline Member
    Join Date
    Jun 2010
    Posts
    4
    Rep Power
    0

    Default

    wowow, I got it, thank you a lot guys. thanks for making me think.

    Here is the final solution:
    (?i)<([^>]+)>([^<]*)bio([^<]*)</([^<]+)>

  6. #6
    mohammedsk is offline Member
    Join Date
    Jun 2010
    Posts
    4
    Rep Power
    0

    Default

    one last question, I am trying to add word boundary around "Bio", but it does not seem to work. Is this the right syntax?

    (?i)<([^>]+)>([^<]*)\bbio\b([^<]*)</([^<]+)>

  7. #7
    Zack's Avatar
    Zack is offline Senior Member
    Join Date
    Jun 2010
    Location
    Destiny Islands
    Posts
    692
    Rep Power
    5

    Default

    You mean so it matches only whole words? I think you'd have to wrap it in spaces. Perhaps something like:

    (?i)<([^>]+)>(([^<]*) )?bio( ([^<]*))?</([^<]+)>

    That's untested, but it should work...

  8. #8
    Lil_Aziz1's Avatar
    Lil_Aziz1 is offline Senior Member
    Join Date
    Dec 2009
    Location
    United States
    Posts
    343
    Rep Power
    5

    Default

    hmm well it could be ...bob</..> or bob's</...>

    the \b should work. did u try \b(bob)\b
    ?
    If not, just make your end word boundary using pipes:

    (\s|>)bob([^A-Za-z]|\s|<)
    "Experience is what you get when you don't get what you want" (Dan Stanford)
    "Rise and rise again until lambs become lions" (Robin Hood)

Similar Threads

  1. How to remove all html tags within my <H1> tag
    By masterrs.mind in forum Advanced Java
    Replies: 3
    Last Post: 03-17-2010, 02:32 AM
  2. Need help in validation HTML tags
    By nn12 in forum New To Java
    Replies: 1
    Last Post: 09-09-2008, 04:18 AM
  3. HTML tags anyone?
    By tim in forum Suggestions & Feedback
    Replies: 2
    Last Post: 06-29-2008, 04:49 AM
  4. Html tags within XML- need help
    By iamhappy in forum XML
    Replies: 2
    Last Post: 03-27-2008, 04:21 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •