Results 1 to 20 of 27
- 09-27-2008, 05:09 PM #1
regex problem - allowing optional space
Hi,
I'm trying to get an expression to match a string with an option space in the second position. I've tried the following patterns with no luck. The pattern is to not match "<bP>" where b is an optional space. I've tried \\s for white space, [ ] with space between, no luck.
The string in the first has a space between the < and the P.System.out.println("< P> <B>and <P>".replaceAll("< *?[^P].*?>", "")); // and <P>
System.out.println("<P> <B>and <P>".replaceAll("< *?[^P].*?>", "")); // <P> and <P>
Desired result for first string: < P> and <P>
Thanks,
NormLast edited by Norm; 09-27-2008 at 08:01 PM. Reason: Corrected example by including <B>
- 09-27-2008, 07:10 PM #2
replaceAll("< ?
deceptive, simple, works
unlike the hour down load for some other stuff I have running right now.
which is simple, deceptive, doesn't work....Introduction to Programming Using Java.
Cybercartography: A new theoretical construct proposed by D.R. Fraser Taylor
- 09-27-2008, 07:15 PM #3
Are you suggesting a solution for my problem?
Can you post an working example?
- 09-28-2008, 11:03 AM #4
Maybe I haven't understood the requirement perfectly, but for the two inputs and expected outputs you have given, this is adequate:
[/CODE]Java Code:[code] public class MatchNotSpaceNotP { public static void main(String[] args) { String[] inputs = new String[2]; inputs[0] = "< P> <B>and <P>"; inputs[1] = "<P> <B>and <P>"; String regex = "<[^P]+>"; for (String string : inputs) { System.out.println(string.replaceAll(regex, "")); } } }
- 09-28-2008, 04:19 PM #5
Darryl,
Thanks for that.
I only gave you part of the specs which include more than skipping over optional spaces after the <.
So let me try again. Here are three strings. One input and two desired outputs, each requiring a different pattern.
Regex are sort of magic for me. Your pattern doesn't seem to account for optional spaces. I had tried: \\s{0,}. No luckString input = "< P>This is<B>a</B> sample.<H1>Header</H1><P>new para<UL><LI>first item</LI></P>";
// Remove <B>, </B> and <UL>
String Desired1 = "< P>This isa sample.<H1>Header</H1><P>new para<LI>first item</LI></P>";
// Remove <B>, </B>, </H1>, </LI> and <UL>
String Desired2 = "< P>This isa sample.<H1>Header<P>new para<LI>first item";
I also tried: "<[^(P|H1|LI)].*?>" But this one failed with the space between the < and the P.
-
If you don't get a decent answer here, I suggest you ask on the forums.sun.com site. There they have some wonderful Regex mavens such as uncleAlice. Of course if you do this link both cross-post threads one to the other...
Good luck.
- 09-28-2008, 07:53 PM #7
Hey Pete gimme a chance ;) Although I do agree that Alan, Bart or Sabre will come up with something much more elegant! or one might almost say, exquisite.
@Norm: The second case in your last posting also removes the </P>, something you haven't included in the list of tags to remove.
There's no magic here, the regex is almost self-explanatory. It's the elegant ones that can consume the last of one's hair :)Java Code:public class MatchNotSpaceNotP { public static void main(String[] args) { String input = "< P>This is<B>a</B> sample." + "<H1>Header</H1>" + "<P>new para<UL><LI>first item</LI></P>"; String desired = "< P>This isa sample." + "<H1>Header</H1>" + "<P>new para<LI>first item</LI></P>"; String regex = "<[ ]{0,1}(B|/B|UL)>"; String output = input.replaceAll(regex, ""); System.out.println("1. " + output.equals(desired)); desired = "< P>This isa sample." + "<H1>Header" + "<P>new para<LI>first item"; regex = "<[ ]{0,1}(B|/B|/H1|/LI|UL|/P)>"; output = input.replaceAll(regex, ""); System.out.println("2. " + output.equals(desired)); } }
If you want to allow for more than one space (zero to any number), just use [ ]* instead of [ ]{0,1}
db
-
Sorry Darryl, I didn't mean to step on your toes. All I know is I'm no regex pro or even amateur just yet.
- 09-28-2008, 08:23 PM #9
Me neither, but I enjoy trying ;)
- 09-29-2008, 02:23 PM #10
thanks again.
I was hoping for a regex that would leave the desired tags vs one that removes the undesired tags. That would make for a shorter, easier to maintain regex.
I keep these as comments in a regex program to remember which shortcuts do what.
// ? = {0,1}
// * = {0,}
// + = {1,}
- 09-29-2008, 02:45 PM #11
For a regex that leaves desired tags, you have to start by framing your requirement that way :) The requirement you posted expresses which tags are to be removed.
Do take Fubarable's advice on trying on the Sun forums, there are at least 3 real regex gurus there -- but don't expect to understand their suggestions at a single reading ;)
- 09-29-2008, 03:23 PM #12
Thanks. Will do.
Sorry for the poorly phrased question. I've seen enough to know better.
- 09-30-2008, 02:07 PM #13
Well one or zero is "?" is it not - thus my original work is not that far away from where we are at now.
We are only coffee drinkers, you are a beer maker so maybe - are you just trying to remove unwanted spaces or remove entire tags. NFA's are remarkably limited in the logic department so deciding whether to keep a tag based on some logic may need to be done as if(regex.find()){next();} or something.
Thus, regex = "<[ ]{0,1}(B|/B|UL)>"; may find things in an unexpected manner later.
{ Fubarable: step on my toes if it bothers the others, no longer matters to me }Introduction to Programming Using Java.
Cybercartography: A new theoretical construct proposed by D.R. Fraser Taylor
- 09-30-2008, 06:22 PM #14
I misstated the original problem. I'll try to restate it. I'm looking for a regex that will remove ALL HTML tags except for a few that I'd like to put in a list such as: (P|H1|LI|<rest of list>). The regex would remove the < -tag stuff- > for those tags NOT in the list. Tags in the list can include blanks. For example: < P> or <LI >. The -tag stuff- can include any legal HTML including blanks. What I came up with that worked for listed tags that did NOT contain spaces: <[^(P|H1|LI)].*?>. This one fails to recognize and not remove: < P> (with a space) but does work with <P>.
I tried adding \\s* to skip optional leading spaces but this didn't work.
- 09-30-2008, 06:58 PM #15
< ?(P|H|LI)>[^<]*< ?/?(P|H|LI)>
I'm practicing on your dime so if a master improves on this both of us gain, plus I can use some of this in what I am working on this morning.Introduction to Programming Using Java.
Cybercartography: A new theoretical construct proposed by D.R. Fraser Taylor
- 09-30-2008, 09:45 PM #16
String data = "<P>This is<B>a</B> sample.<H1>Header</H1><P>new para<UL><LI>first item</LI></P>";
String regex = "< ?(P|H|LI)>[^<]*< ?/?(P|H|LI)>";
System.out.println(regex + " gives " + data.replaceAll(regex, ""));
// The output:
< ?(P|H|LI)>[^<]*< ?/?(P|H|LI)> gives <P>This is<B>a</B> sample.<H1>Header</H1><P>new para<UL></P>
Nope that doesn't remove the <B> and <UL>
- 10-01-2008, 10:04 PM #17
(Code earlier posted in Norm's thread on the Sun forum)
dbJava Code:public class RetainTags { public static void main(String[] args) { String input = "< P>This is<B>a</B> sample." + "<H1>Header</H1>" + "<P>new para<UL><LI>first item</LI></P>"; String[] tagsToRetain = {"B", "/B", "UL"}; char leftDelimiter = '\u00AB'; while (input.contains("" + leftDelimiter)) { leftDelimiter++; } char rightDelimiter = (char) (leftDelimiter + 1); while (input.contains("" + rightDelimiter)) { rightDelimiter++; } String regex; String output = input; for (String tag : tagsToRetain) { regex = "(<\\s?" + tag + ">)"; output = output.replaceAll(regex, leftDelimiter + "$1" + rightDelimiter); System.out.println(output); } regex = "(?<!" + leftDelimiter + ")<[^>]+>(?!" + rightDelimiter+ ")"; output = output.replaceAll(regex, ""); System.out.println(output); regex = "[" + leftDelimiter + rightDelimiter + "]"; output = output.replaceAll(regex, ""); System.out.println(output); } }
edit The code depends on finding two characters not contained in the input String. I chose the left chevron purely arbitrarily as a starting point for the search.Last edited by DarrylBurke; 10-01-2008 at 10:07 PM.
- 10-02-2008, 02:59 AM #18
Darryl.
Thanks for finding a solution. Not very elegant having to do three passes, but it works.
Could you tell me where I can find doc you are using for regex, especially the ! character.
The API doc says "negative" lookahead and lookbehind. I have the the Programming Perl book as reference and nothing said about that.
Thanks.
- 10-02-2008, 05:58 AM #19
Um, it's not necessarily 3 passes, the solution is scalable to as many tags as you want to retain. Not reasonably likely to happen here, but strictly speaking the variable concatenation should be surrounded by quoteAll (either by the \Q and \E metacharacters or quote()) so that a metacharacter doesn't happen to blow up the regex:
orJava Code:regex = "(<\\s?\\Q" + tag + "\\E>)";
The starting reference work is of course the Pattern API, and I mostly refer to this tutorial when I'm stuck:Java Code:regex = "(<\\s?" + Pattern.quote(tag) + ">)";
Regular Expression Tutorial - Learn How to Use Regular Expressions
When I'm really stuck I read through anything Google can find for me.
And when I'm really really stuck I wait for some guru to post a solution and then try to understand it, with help from the author ;)
db
- 10-02-2008, 02:02 PM #20
Similar Threads
-
[SOLVED] More RegEx help
By JT4NK3D in forum New To JavaReplies: 2Last Post: 05-23-2008, 04:07 AM -
Allowing only numeric values in a TextField
By Java Tip in forum Java TipReplies: 0Last Post: 03-01-2008, 10:08 PM -
Using Scanner with regex.MatchResult
By Java Tip in forum Java TipReplies: 0Last Post: 01-18-2008, 02:08 PM -
Regex Quantifiers Example
By Java Tip in forum Java TipReplies: 0Last Post: 01-10-2008, 10:44 AM -
Regex pattern
By ravian in forum New To JavaReplies: 4Last Post: 12-11-2007, 10:20 AM


LinkBack URL
About LinkBacks
Reply With Quote
Bookmarks