Page 1 of 2 12 LastLast
Results 1 to 20 of 27
  1. #1
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,305
    Rep Power
    25

    Default regex problem - allowing optional space

    Hi,
    I'm trying to get an expression to match a string with an option space in the second position. I've tried the following patterns with no luck. The pattern is to not match "<bP>" where b is an optional space. I've tried \\s for white space, [ ] with space between, no luck.
    System.out.println("< P> <B>and <P>".replaceAll("< *?[^P].*?>", "")); // and <P>
    System.out.println("<P> <B>and <P>".replaceAll("< *?[^P].*?>", "")); // <P> and <P>
    The string in the first has a space between the < and the P.
    Desired result for first string: < P> and <P>

    Thanks,
    Norm
    Last edited by Norm; 09-27-2008 at 08:01 PM. Reason: Corrected example by including <B>

  2. #2
    Nicholas Jordan's Avatar
    Nicholas Jordan is offline Senior Member
    Join Date
    Jun 2008
    Location
    Southwest
    Posts
    1,018
    Rep Power
    8

    Default

    replaceAll("< ?

    deceptive, simple, works

    unlike the hour down load for some other stuff I have running right now.

    which is simple, deceptive, doesn't work....
    Introduction to Programming Using Java.
    Cybercartography: A new theoretical construct proposed by D.R. Fraser Taylor

  3. #3
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,305
    Rep Power
    25

    Default

    Are you suggesting a solution for my problem?
    Can you post an working example?

  4. #4
    DarrylBurke's Avatar
    DarrylBurke is offline Member
    Join Date
    Sep 2008
    Location
    Madgaon, Goa, India
    Posts
    11,188
    Rep Power
    19

    Default

    Maybe I haven't understood the requirement perfectly, but for the two inputs and expected outputs you have given, this is adequate:
    Java Code:
    [code]
    public class MatchNotSpaceNotP {
    
       public static void main(String[] args) {
          String[] inputs = new String[2];
          inputs[0] = "< P> <B>and <P>";
          inputs[1] = "<P> <B>and <P>";
    
          String regex = "<[^P]+>";
          for (String string : inputs) {
             System.out.println(string.replaceAll(regex, ""));
          }
       }
    }
    [/CODE]

  5. #5
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,305
    Rep Power
    25

    Default

    Darryl,
    Thanks for that.
    I only gave you part of the specs which include more than skipping over optional spaces after the <.

    So let me try again. Here are three strings. One input and two desired outputs, each requiring a different pattern.
    String input = "< P>This is<B>a</B> sample.<H1>Header</H1><P>new para<UL><LI>first item</LI></P>";
    // Remove <B>, </B> and <UL>
    String Desired1 = "< P>This isa sample.<H1>Header</H1><P>new para<LI>first item</LI></P>";
    // Remove <B>, </B>, </H1>, </LI> and <UL>
    String Desired2 = "< P>This isa sample.<H1>Header<P>new para<LI>first item";
    Regex are sort of magic for me. Your pattern doesn't seem to account for optional spaces. I had tried: \\s{0,}. No luck
    I also tried: "<[^(P|H1|LI)].*?>" But this one failed with the space between the < and the P.

  6. #6
    Fubarable's Avatar
    Fubarable is offline Moderator
    Join Date
    Jun 2008
    Posts
    19,316
    Blog Entries
    1
    Rep Power
    26

    Default

    If you don't get a decent answer here, I suggest you ask on the forums.sun.com site. There they have some wonderful Regex mavens such as uncleAlice. Of course if you do this link both cross-post threads one to the other...

    Good luck.

  7. #7
    DarrylBurke's Avatar
    DarrylBurke is offline Member
    Join Date
    Sep 2008
    Location
    Madgaon, Goa, India
    Posts
    11,188
    Rep Power
    19

    Default

    Hey Pete gimme a chance ;) Although I do agree that Alan, Bart or Sabre will come up with something much more elegant! or one might almost say, exquisite.

    @Norm: The second case in your last posting also removes the </P>, something you haven't included in the list of tags to remove.
    Java Code:
    public class MatchNotSpaceNotP {
    
       public static void main(String[] args) {
          String input = "< P>This is<B>a</B> sample." +
                  "<H1>Header</H1>" +
                  "<P>new para<UL><LI>first item</LI></P>";
          String desired = "< P>This isa sample." +
                  "<H1>Header</H1>" +
                  "<P>new para<LI>first item</LI></P>";
          String regex = "<[ ]{0,1}(B|/B|UL)>";
          String output = input.replaceAll(regex, "");
          System.out.println("1. " + output.equals(desired));
          
          desired = "< P>This isa sample." +
                  "<H1>Header" +
                  "<P>new para<LI>first item";
          regex = "<[ ]{0,1}(B|/B|/H1|/LI|UL|/P)>";
          output = input.replaceAll(regex, "");
          System.out.println("2. " + output.equals(desired));
       }
    }
    There's no magic here, the regex is almost self-explanatory. It's the elegant ones that can consume the last of one's hair :)

    If you want to allow for more than one space (zero to any number), just use [ ]* instead of [ ]{0,1}

    db

  8. #8
    Fubarable's Avatar
    Fubarable is offline Moderator
    Join Date
    Jun 2008
    Posts
    19,316
    Blog Entries
    1
    Rep Power
    26

    Default

    Sorry Darryl, I didn't mean to step on your toes. All I know is I'm no regex pro or even amateur just yet.

  9. #9
    DarrylBurke's Avatar
    DarrylBurke is offline Member
    Join Date
    Sep 2008
    Location
    Madgaon, Goa, India
    Posts
    11,188
    Rep Power
    19

    Default

    Me neither, but I enjoy trying ;)

  10. #10
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,305
    Rep Power
    25

    Default

    thanks again.
    I was hoping for a regex that would leave the desired tags vs one that removes the undesired tags. That would make for a shorter, easier to maintain regex.

    I keep these as comments in a regex program to remember which shortcuts do what.
    // ? = {0,1}
    // * = {0,}
    // + = {1,}

  11. #11
    DarrylBurke's Avatar
    DarrylBurke is offline Member
    Join Date
    Sep 2008
    Location
    Madgaon, Goa, India
    Posts
    11,188
    Rep Power
    19

    Default

    For a regex that leaves desired tags, you have to start by framing your requirement that way :) The requirement you posted expresses which tags are to be removed.

    Do take Fubarable's advice on trying on the Sun forums, there are at least 3 real regex gurus there -- but don't expect to understand their suggestions at a single reading ;)

  12. #12
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,305
    Rep Power
    25

    Default

    Thanks. Will do.
    Sorry for the poorly phrased question. I've seen enough to know better.

  13. #13
    Nicholas Jordan's Avatar
    Nicholas Jordan is offline Senior Member
    Join Date
    Jun 2008
    Location
    Southwest
    Posts
    1,018
    Rep Power
    8

    Default

    Well one or zero is "?" is it not - thus my original work is not that far away from where we are at now.

    We are only coffee drinkers, you are a beer maker so maybe - are you just trying to remove unwanted spaces or remove entire tags. NFA's are remarkably limited in the logic department so deciding whether to keep a tag based on some logic may need to be done as if(regex.find()){next();} or something.

    Thus, regex = "<[ ]{0,1}(B|/B|UL)>"; may find things in an unexpected manner later.

    { Fubarable: step on my toes if it bothers the others, no longer matters to me }
    Introduction to Programming Using Java.
    Cybercartography: A new theoretical construct proposed by D.R. Fraser Taylor

  14. #14
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,305
    Rep Power
    25

    Default

    I misstated the original problem. I'll try to restate it. I'm looking for a regex that will remove ALL HTML tags except for a few that I'd like to put in a list such as: (P|H1|LI|<rest of list>). The regex would remove the < -tag stuff- > for those tags NOT in the list. Tags in the list can include blanks. For example: < P> or <LI >. The -tag stuff- can include any legal HTML including blanks. What I came up with that worked for listed tags that did NOT contain spaces: <[^(P|H1|LI)].*?>. This one fails to recognize and not remove: < P> (with a space) but does work with <P>.
    I tried adding \\s* to skip optional leading spaces but this didn't work.

  15. #15
    Nicholas Jordan's Avatar
    Nicholas Jordan is offline Senior Member
    Join Date
    Jun 2008
    Location
    Southwest
    Posts
    1,018
    Rep Power
    8

    Default

    < ?(P|H|LI)>[^<]*< ?/?(P|H|LI)>

    I'm practicing on your dime so if a master improves on this both of us gain, plus I can use some of this in what I am working on this morning.
    Introduction to Programming Using Java.
    Cybercartography: A new theoretical construct proposed by D.R. Fraser Taylor

  16. #16
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,305
    Rep Power
    25

    Default

    String data = "<P>This is<B>a</B> sample.<H1>Header</H1><P>new para<UL><LI>first item</LI></P>";
    String regex = "< ?(P|H|LI)>[^<]*< ?/?(P|H|LI)>";
    System.out.println(regex + " gives " + data.replaceAll(regex, ""));
    // The output:
    < ?(P|H|LI)>[^<]*< ?/?(P|H|LI)> gives <P>This is<B>a</B> sample.<H1>Header</H1><P>new para<UL></P>

    Nope that doesn't remove the <B> and <UL>

  17. #17
    DarrylBurke's Avatar
    DarrylBurke is offline Member
    Join Date
    Sep 2008
    Location
    Madgaon, Goa, India
    Posts
    11,188
    Rep Power
    19

    Default

    (Code earlier posted in Norm's thread on the Sun forum)
    Java Code:
    public class RetainTags {
     
       public static void main(String[] args) {
          String input = "< P>This is<B>a</B> sample." +
                  "<H1>Header</H1>" +
                  "<P>new para<UL><LI>first item</LI></P>";
          String[] tagsToRetain = {"B", "/B", "UL"};
          
          char leftDelimiter = '\u00AB';
          while (input.contains("" + leftDelimiter)) {
             leftDelimiter++;
          }
          char rightDelimiter = (char) (leftDelimiter + 1);
          while (input.contains("" + rightDelimiter)) {
             rightDelimiter++;
          }
          
          String regex;
          String output = input;
          for (String tag : tagsToRetain) {
             regex = "(<\\s?" + tag + ">)";
             output = output.replaceAll(regex, leftDelimiter + "$1" + rightDelimiter);
             System.out.println(output);
          }
     
          regex = "(?<!" + leftDelimiter + ")<[^>]+>(?!" + rightDelimiter+ ")";
          output = output.replaceAll(regex, "");
          System.out.println(output);
          
          regex = "[" + leftDelimiter + rightDelimiter + "]";
          output = output.replaceAll(regex, "");
          System.out.println(output);
       }
    }
    db

    edit The code depends on finding two characters not contained in the input String. I chose the left chevron purely arbitrarily as a starting point for the search.
    Last edited by DarrylBurke; 10-01-2008 at 10:07 PM.

  18. #18
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,305
    Rep Power
    25

    Default

    Darryl.
    Thanks for finding a solution. Not very elegant having to do three passes, but it works.
    Could you tell me where I can find doc you are using for regex, especially the ! character.
    The API doc says "negative" lookahead and lookbehind. I have the the Programming Perl book as reference and nothing said about that.
    Thanks.

  19. #19
    DarrylBurke's Avatar
    DarrylBurke is offline Member
    Join Date
    Sep 2008
    Location
    Madgaon, Goa, India
    Posts
    11,188
    Rep Power
    19

    Default

    Um, it's not necessarily 3 passes, the solution is scalable to as many tags as you want to retain. Not reasonably likely to happen here, but strictly speaking the variable concatenation should be surrounded by quoteAll (either by the \Q and \E metacharacters or quote()) so that a metacharacter doesn't happen to blow up the regex:
    Java Code:
    regex = "(<\\s?\\Q" + tag + "\\E>)";
    or
    Java Code:
    regex = "(<\\s?" + Pattern.quote(tag) + ">)";
    The starting reference work is of course the Pattern API, and I mostly refer to this tutorial when I'm stuck:
    Regular Expression Tutorial - Learn How to Use Regular Expressions

    When I'm really stuck I read through anything Google can find for me.

    And when I'm really really stuck I wait for some guru to post a solution and then try to understand it, with help from the author ;)

    db

  20. #20
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,305
    Rep Power
    25

Page 1 of 2 12 LastLast

Similar Threads

  1. [SOLVED] More RegEx help
    By JT4NK3D in forum New To Java
    Replies: 2
    Last Post: 05-23-2008, 04:07 AM
  2. Allowing only numeric values in a TextField
    By Java Tip in forum Java Tip
    Replies: 0
    Last Post: 03-01-2008, 10:08 PM
  3. Using Scanner with regex.MatchResult
    By Java Tip in forum Java Tip
    Replies: 0
    Last Post: 01-18-2008, 02:08 PM
  4. Regex Quantifiers Example
    By Java Tip in forum Java Tip
    Replies: 0
    Last Post: 01-10-2008, 10:44 AM
  5. Regex pattern
    By ravian in forum New To Java
    Replies: 4
    Last Post: 12-11-2007, 10:20 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •