Results 1 to 12 of 12
  1. #1
    johann_p is offline Member
    Join Date
    Jun 2007
    Posts
    19
    Rep Power
    0

    Question Regular Expresion: match longest leftmost alternative, not first?

    Java regular expressions always seem to match the first of several alternatives, no matter which of the alternatives would produce the longest match.

    With the POSIX regexp standard, the leftmost alternative which produces the longest match is chosen. Is it somehow possible to achieve that in Java, if it must be, through a thrid party library?

  2. #2
    DarrylBurke's Avatar
    DarrylBurke is offline Forum Police
    Join Date
    Sep 2008
    Location
    Madgaon, Goa, India
    Posts
    11,458
    Rep Power
    20

    Default

    Use an appropriate quantifier (see the Pattern javadoc). No need for a third party library.

    db

  3. #3
    johann_p is offline Member
    Join Date
    Jun 2007
    Posts
    19
    Rep Power
    0

    Default

    I do not see the connection with quantifiers? My question is about matching the longest of several alternatives instead of the leftmost alternative.

  4. #4
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,765
    Blog Entries
    7
    Rep Power
    21

    Default

    Quote Originally Posted by johann_p View Post
    I do not see the connection with quantifiers? My question is about matching the longest of several alternatives instead of the leftmost alternative.
    This is straight from the API documentation for the Pattern class:

    Quote Originally Posted by Pattern API

    Greedy quantifiers

    X? X, once or not at all
    X* X, zero or more times
    X+ X, one or more times
    X{n} X, exactly n times
    X{n,} X, at least n times
    X{n,m} X, at least n but not more than m times

    Reluctant quantifiers

    X?? X, once or not at all
    X*? X, zero or more times
    X+? X, one or more times
    X{n}? X, exactly n times
    X{n,}? X, at least n times
    X{n,m}? X, at least n but not more than m times

    Possessive quantifiers

    X?+ X, once or not at all
    X*+ X, zero or more times
    X++ X, one or more times
    X{n}+ X, exactly n times
    X{n,}+ X, at least n times
    X{n,m}+ X, at least n but not more than m times
    kind regards,

    Jos
    cenosillicaphobia: the fear for an empty beer glass

  5. #5
    johann_p is offline Member
    Join Date
    Jun 2007
    Posts
    19
    Rep Power
    0

    Default

    Thank you, but I think this is a different topic and not related to my post.

    What I am interested in is the following: we have a pattern iof the following form
    (<alternativeA>|<alternativeB>)
    where both alternatives contain some quantifiers. It is not relevant if these quantifiers are greedy, lazy or possessive.
    Now, for a number of strings, the patterns for both alternative A and B can match and sometimes pattern A produces the longer match and sometimes the pattern B would produce the longer match.
    I would like to always get the longer match (in my real example I have more alternatives).
    However, the Java regexp engine always picks the first alternative from the left that matches, so even if alternative B would produce a longer match, it is not even tried.
    This is different with regular expression engines that follow the POSIX standard, which always match against that alternatives which produces the longest match (if there are several with the same match length, the leftmost one is picked).
    As far as reading the docs has shown me that behavior does not seem to be possible with the Java built in matcher and there is really no workaround to produce that behavior either.

    So my question is: am i correct that it is indeed impossible POSIX-conformant matching of leftmost longest alternatives with standard Java and if so, is there some library that can do it?

  6. #6
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,765
    Blog Entries
    7
    Rep Power
    21

    Default

    Would it help if you tried both forms and see which one matched the longer (sub)string? e.g.

    Java Code:
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class T {
    	public static void main(String args[]) throws Exception {
    
    		String   s= "aaaaaba";
    		Pattern pa= Pattern.compile("a*b|aa*ba");
    		Pattern pb= Pattern.compile("aa*ba|a*b");
    		
    		Matcher ma= pa.matcher(s);
    
    		while (ma.find()) {
    			System.out.println(ma.group());
    		}
    
    		Matcher mb= pb.matcher(s);
    
    		while (mb.find()) {
    			System.out.println(mb.group());
    		}
    	}
    }
    kind regards,

    Jos
    cenosillicaphobia: the fear for an empty beer glass

  7. #7
    johann_p is offline Member
    Join Date
    Jun 2007
    Posts
    19
    Rep Power
    0

    Default

    I have thought of this approach, but it has two disadvantages: as I said, I do not just have two alternatives but several. So the performance will probably be much worse than doing this with a single finite automata compiled from the regular expression. Second, doing it for each alternative seperately will find overlapping matches which need to get sorted out, costing even more performance.

    This all could get avoided by some library that simply implements POSIX-style alternatives matching - I guess I am out of luck here though.

  8. #8
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,765
    Blog Entries
    7
    Rep Power
    21

    Default

    Quote Originally Posted by johann_p View Post
    I have thought of this approach, but it has two disadvantages: as I said, I do not just have two alternatives but several. So the performance will probably be much worse than doing this with a single finite automata compiled from the regular expression. Second, doing it for each alternative seperately will find overlapping matches which need to get sorted out, costing even more performance.
    Can you give an example showing what you mean because in my simplistic mind for alternatives A0, A1, ... An there is no need to try each permutation of alternatives, i.e. A0|A1|A2, A1|A2|A1 etc. you just try each alternative and pick the longest matching alternative ...

    kind regards,

    Jos
    cenosillicaphobia: the fear for an empty beer glass

  9. #9
    johann_p is offline Member
    Join Date
    Jun 2007
    Posts
    19
    Rep Power
    0

    Default

    Well if there is no option to make POSIX-style alternatives matching work with Java regexps and there is no library either, I think i know how to simulate it (pseudocode):
    Java Code:
    curoffset = 0
    while(curoffset < stringlength) {
      for each matcher where matcher.start < curoffset: reset the matcher to find from curoffset
      find the smallest start of all matchers, and for all matchers with the smallest start, find the longest matches
      choose that longest match that has the lowest matcher number
      consume that match and set curroffset to that matcher's end()
    }
    This would reduce the number of times we have to reset the matchers and still avoid problems of the matchers getting out of sync with other matcher's longest matches (and thus matching invalid substrings).

    However the overall effort is of corse still higher than with a single DFA (or even an NFA simulating text-directed DFA matching)

    If anyone *does* know a library that can do POSIX-style alternatives matching or some other apprach, I would still be happy to hear about it.
    Last edited by johann_p; 06-11-2011 at 07:34 PM.

  10. #10
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,765
    Blog Entries
    7
    Rep Power
    21

    Default

    Quote Originally Posted by johann_p View Post
    If anyone *does* know a library that can do POSIX-style alternatives matching or some other apprach, I would still be happy to hear about it.
    Maybe Apache's regexp package can do what you want. Note: I didn't try it and maybe the built-in regexp pacakge can do what you want, I'm just not one of those regexp gurus (there are some of those in Oracle's Java forum).

    kind regards,

    Jos
    cenosillicaphobia: the fear for an empty beer glass

  11. #11
    DarrylBurke's Avatar
    DarrylBurke is offline Forum Police
    Join Date
    Sep 2008
    Location
    Madgaon, Goa, India
    Posts
    11,458
    Rep Power
    20

    Default

    Quote Originally Posted by JosAH View Post
    ... one of those regexp gurus (there are some of those in Oracle's Java forum).
    Just the sharp one, I guess, and I see his posts more often at the ranch than on SnOracle nowadays. Haven't seen your countryman the Greek god for ages though. And if Uncle the super guru's there (I don't think so), he has a different name/number.

    db
    Last edited by DarrylBurke; 06-12-2011 at 06:47 AM.

  12. #12
    JosAH's Avatar
    JosAH is offline Moderator
    Join Date
    Sep 2008
    Location
    Voorschoten, the Netherlands
    Posts
    13,765
    Blog Entries
    7
    Rep Power
    21

    Default

    Quote Originally Posted by DarrylBurke View Post
    Just the sharp one, I guess, and I see his posts more often at the ranch than on SnOracle nowadays. Haven't seen your countryman the Greek god for ages though. And if Uncle the super guru's there (I don't think so), he has a different name/number.

    db
    Bart took my advice on regular expressions (they are overly used and ugly) and nowadays I see him in my mailbox on the ANTLR discussion list ;-)

    kind regards,

    Jos
    cenosillicaphobia: the fear for an empty beer glass

Similar Threads

  1. Help with TreeSet and finding longest string
    By JavaTheHut in forum New To Java
    Replies: 3
    Last Post: 05-04-2011, 11:23 PM
  2. Longest common sequence
    By fam2315 in forum New To Java
    Replies: 8
    Last Post: 04-29-2011, 03:00 AM
  3. Longest word in a program...
    By hustlas4ever in forum New To Java
    Replies: 5
    Last Post: 08-20-2010, 02:34 PM
  4. Replies: 6
    Last Post: 02-21-2010, 01:54 AM
  5. Get all groups from a regular expression match?
    By johann_p in forum New To Java
    Replies: 0
    Last Post: 05-16-2008, 08:50 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •