Results 1 to 12 of 12
- 06-10-2011, 04:32 PM #1
Member
- Join Date
- Jun 2007
- Posts
- 19
- Rep Power
- 0
Regular Expresion: match longest leftmost alternative, not first?
Java regular expressions always seem to match the first of several alternatives, no matter which of the alternatives would produce the longest match.
With the POSIX regexp standard, the leftmost alternative which produces the longest match is chosen. Is it somehow possible to achieve that in Java, if it must be, through a thrid party library?
- 06-10-2011, 07:31 PM #2
Use an appropriate quantifier (see the Pattern javadoc). No need for a third party library.
db
- 06-11-2011, 10:02 AM #3
Member
- Join Date
- Jun 2007
- Posts
- 19
- Rep Power
- 0
I do not see the connection with quantifiers? My question is about matching the longest of several alternatives instead of the leftmost alternative.
- 06-11-2011, 01:13 PM #4
- Join Date
- Sep 2008
- Location
- Voorschoten, the Netherlands
- Posts
- 11,604
- Blog Entries
- 7
- Rep Power
- 17
- 06-11-2011, 03:40 PM #5
Member
- Join Date
- Jun 2007
- Posts
- 19
- Rep Power
- 0
Thank you, but I think this is a different topic and not related to my post.
What I am interested in is the following: we have a pattern iof the following form
(<alternativeA>|<alternativeB>)
where both alternatives contain some quantifiers. It is not relevant if these quantifiers are greedy, lazy or possessive.
Now, for a number of strings, the patterns for both alternative A and B can match and sometimes pattern A produces the longer match and sometimes the pattern B would produce the longer match.
I would like to always get the longer match (in my real example I have more alternatives).
However, the Java regexp engine always picks the first alternative from the left that matches, so even if alternative B would produce a longer match, it is not even tried.
This is different with regular expression engines that follow the POSIX standard, which always match against that alternatives which produces the longest match (if there are several with the same match length, the leftmost one is picked).
As far as reading the docs has shown me that behavior does not seem to be possible with the Java built in matcher and there is really no workaround to produce that behavior either.
So my question is: am i correct that it is indeed impossible POSIX-conformant matching of leftmost longest alternatives with standard Java and if so, is there some library that can do it?
- 06-11-2011, 03:58 PM #6
- Join Date
- Sep 2008
- Location
- Voorschoten, the Netherlands
- Posts
- 11,604
- Blog Entries
- 7
- Rep Power
- 17
Would it help if you tried both forms and see which one matched the longer (sub)string? e.g.
kind regards,Java Code:import java.util.regex.Matcher; import java.util.regex.Pattern; public class T { public static void main(String args[]) throws Exception { String s= "aaaaaba"; Pattern pa= Pattern.compile("a*b|aa*ba"); Pattern pb= Pattern.compile("aa*ba|a*b"); Matcher ma= pa.matcher(s); while (ma.find()) { System.out.println(ma.group()); } Matcher mb= pb.matcher(s); while (mb.find()) { System.out.println(mb.group()); } } }
JosWhen people rob a bank they get a penalty; when banks rob people they get a bonus.
- 06-11-2011, 05:50 PM #7
Member
- Join Date
- Jun 2007
- Posts
- 19
- Rep Power
- 0
I have thought of this approach, but it has two disadvantages: as I said, I do not just have two alternatives but several. So the performance will probably be much worse than doing this with a single finite automata compiled from the regular expression. Second, doing it for each alternative seperately will find overlapping matches which need to get sorted out, costing even more performance.
This all could get avoided by some library that simply implements POSIX-style alternatives matching - I guess I am out of luck here though.
- 06-11-2011, 06:02 PM #8
- Join Date
- Sep 2008
- Location
- Voorschoten, the Netherlands
- Posts
- 11,604
- Blog Entries
- 7
- Rep Power
- 17
Can you give an example showing what you mean because in my simplistic mind for alternatives A0, A1, ... An there is no need to try each permutation of alternatives, i.e. A0|A1|A2, A1|A2|A1 etc. you just try each alternative and pick the longest matching alternative ...
kind regards,
JosWhen people rob a bank they get a penalty; when banks rob people they get a bonus.
- 06-11-2011, 06:32 PM #9
Member
- Join Date
- Jun 2007
- Posts
- 19
- Rep Power
- 0
Well if there is no option to make POSIX-style alternatives matching work with Java regexps and there is no library either, I think i know how to simulate it (pseudocode):
This would reduce the number of times we have to reset the matchers and still avoid problems of the matchers getting out of sync with other matcher's longest matches (and thus matching invalid substrings).Java Code:curoffset = 0 while(curoffset < stringlength) { for each matcher where matcher.start < curoffset: reset the matcher to find from curoffset find the smallest start of all matchers, and for all matchers with the smallest start, find the longest matches choose that longest match that has the lowest matcher number consume that match and set curroffset to that matcher's end() }
However the overall effort is of corse still higher than with a single DFA (or even an NFA simulating text-directed DFA matching)
If anyone *does* know a library that can do POSIX-style alternatives matching or some other apprach, I would still be happy to hear about it.Last edited by johann_p; 06-11-2011 at 06:34 PM.
- 06-11-2011, 08:41 PM #10
- Join Date
- Sep 2008
- Location
- Voorschoten, the Netherlands
- Posts
- 11,604
- Blog Entries
- 7
- Rep Power
- 17
Maybe Apache's regexp package can do what you want. Note: I didn't try it and maybe the built-in regexp pacakge can do what you want, I'm just not one of those regexp gurus (there are some of those in Oracle's Java forum).
kind regards,
JosWhen people rob a bank they get a penalty; when banks rob people they get a bonus.
- 06-11-2011, 10:41 PM #11
Last edited by DarrylBurke; 06-12-2011 at 05:47 AM.
- 06-12-2011, 08:38 AM #12
- Join Date
- Sep 2008
- Location
- Voorschoten, the Netherlands
- Posts
- 11,604
- Blog Entries
- 7
- Rep Power
- 17
Similar Threads
-
Help with TreeSet and finding longest string
By JavaTheHut in forum New To JavaReplies: 3Last Post: 05-04-2011, 10:23 PM -
Longest common sequence
By fam2315 in forum New To JavaReplies: 8Last Post: 04-29-2011, 02:00 AM -
Longest word in a program...
By hustlas4ever in forum New To JavaReplies: 5Last Post: 08-20-2010, 01:34 PM -
SQLException:java.sql.Exception:Data type mismatch in criteria expresion
By Dumisan in forum JDBCReplies: 6Last Post: 02-21-2010, 12:54 AM -
Get all groups from a regular expression match?
By johann_p in forum New To JavaReplies: 0Last Post: 05-16-2008, 07:50 PM


LinkBack URL
About LinkBacks
Reply With Quote

Bookmarks