Results 21 to 27 of 27
- 10-02-2008, 03:45 PM #21
Mastering Regular Expressions
Mastering Regular Expressions which appears to be profoundly useful in this setting. I would have worked on this using that book but am swamped right now.
It is hard to envision someone of this caliber being "really stuck" .....
- 10-03-2008, 07:04 AM #22
Actually, regex are not always a good solution to problems like this.
It looks like you are trying to pull apart real HTML, which is rarely compliant with even the weak HTML specs.
There are times when you really want a proper parser. In the olden days, you'd use lex and yacc, but there are now Java equivalents and even compiler development books using them.
With a context grammar, you can easily decide which whitespace is meaningful and which is just optional filler. Doing it in a context free grammar may be possible, but I have not beat my brain into those depths in a decade.
- 10-03-2008, 02:11 PM #23
This was a exercise to learn regex.
I'll restate the problem.
Looking for a regex that will match a substring:
starting with a <
followed by optional spaces
followed by any char that is NOT a P
followed by any characters
followed by a >
I can't get the optional spaces bit to work.
I thought \s*? would do it : white space, 0 or more, non greedy
- 10-05-2008, 06:19 PM #24
Here's as far as I'm going with this problem. The given regex removes all tags except those in the list. It allows for leading blanks and lower case
String regex = "<[^((\\s*P)|(\\s*H1)|(\\s*LI))].*?>"; // Remove all tags except these System.out.println(regex + " on '" + input + "' = '" + Pattern.compile(regex, Pattern.CASE_INSENSITIVE).matcher(input).replaceAll("") +"'");
- 10-05-2008, 07:51 PM #25Java Code:
String regex = "<[^((\\s*P)|(\\s*H1)|(\\s*LI))].*?>";
Probably an interesting study, but where is the forward slash in the closing tag?... and this will leave the text between opening and closing tags, will it not? which appears here to me to be the intent of the work. To save the displayed text, perhaps exposing that to further work later in the code such as wrapping in new tag I would think we could use round braces in a tree like manner, placing code to find the closing tag due to the limited logic of Pattern,Matcher combinations.
Mostly, not to be picky, what I see here is alternation inside  - which appears to me to need further review. Then later we have .* which I would think would slurp to the eol. I would, if I use that place it at the back of a tree or something so as not to pick it up or use sytax which states, "this is here but do not use it" for efficiency reasons. The above code reads to me to skip any tag ( the contents of any tag ) which holds P or H or "1" or L or I but finds the opening braces either way and does not account for the forward slash in the closing tag. Text between tags would be skipped and thus the design says to me student's work that finds all opening tags except for not doing it correctly.
Not harping, learning on your dime.
- 10-05-2008, 08:49 PM #26
Good points. I'll leave them for those interested to pursue.
- 10-06-2008, 04:38 PM #27
I posted about the regex on a one read / my take on the matter and decided to just try to read the regex and convert to plain english. I have a rather good book on regexes, the author who has mastery of the subject takes a decent part of a major chapter to show  v () as syntax. Short of the deal is for kiss on smaller, use ( | | | ) but for finds that can grow large on large strings something along the lines of "^\\w+" or "^[abcdef]+" or "< ?[lsmft]>" or something is of greater efficiency. Alternation: | can and does drive NFA regex engines nuts on some conditions unless there is good tweaks and optimizations built in.
- By JT4NK3D in forum New To JavaReplies: 2Last Post: 05-23-2008, 04:07 AM
- By Java Tip in forum Java TipReplies: 0Last Post: 03-01-2008, 10:08 PM
- By Java Tip in forum Java TipReplies: 0Last Post: 01-18-2008, 02:08 PM
- By Java Tip in forum Java TipReplies: 0Last Post: 01-10-2008, 10:44 AM
- By ravian in forum New To JavaReplies: 4Last Post: 12-11-2007, 10:20 AM