Java Forums

Main Menu
Home
Today's Posts
FAQ
Search
Contact Us

Java Network
Java Tips
Java Tips Blog

Sponsored Links





Welcome to the Java Forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community, you will:

  • have access to post topics
  • communicate privately with other members (PM)
  • not see advertisements between posts
  • have the possibility to earn one of our surprises if you are an active member
  • access many other special features that will be introduced later.

Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact us.

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 01-07-2008, 01:46 PM
Member
 
Join Date: Nov 2007
Posts: 23
Java Tutorial is on a distinguished road
Handling regular expressions using Regex
Regular expressions

Common characteristics of text are defined as regular expression, which is used in processing. One must know the syntax of regular expression in order to use them. Learning regular expression is not difficult at all.

A common use or regular expression is in search tools. You may specify regular expression to search for some text and searching is performed based on that pattern.

Java provides support for handling regular expressions by providing regex API (java.util.regex).

Package java.util.regex

Regex API has only two classes: Matcher and Pattern. It includes an exception called PatternSyntaxException.
Pattern object refers to the regular expression. Thing to note is that Pattern class does not contain public constructor which means creating pattern is tricky. Pattern object is created using compile method which is static.

Code:
Pattern pattern = Pattern.compile(regexPattern);
Compile method accepts a regular expression as argument. So you have to define the regular expression before creating the pattern object.

Next step is to create a Matcher object.

Matcher is used to search the regular expression in a given text. It also does not have a public constructor. A Matcher object is created by invoking the matcher method on a Pattern object.

Code:
Matcher matcher = pattern.matcher(inputStr);
Now coming to PatternSyntaxException. PatternSyntaxException is raised when there is a syntax error in a regular expression pattern.

A simple example

Time for a basic and simple example.

Code:
Pattern pattern = Pattern.compile("Hello[0-9]*"); CharSequence inputStr = "Hello22. I am fine."; Matcher matcher = pattern.matcher(inputStr); if(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString()); } else System.out.println("Not found.");
Output:
Code:
Pattern found: Hello22
Pattern object was created using static method compile. Regular expression was supplied as argument. CharSequence contains the source data. Matcher object is created using matcher(…) method and is supplied with the source data as argument. All is set now. Method find() is called with matcher object to look for the pattern in the source data. If found, the text is displayed on the console, otherwise, “Not found.” message will be displayed on the console.

If you believe there can be more than one occurrence of the pattern in the text, you can use a while loop to get all the matches:

Code:
while(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString()); }
Regular expression with characters

The simplest form of regular expression comprise of characters enclosed in square brackets. For example:

Code:
Pattern pattern = Pattern.compile("H[aeo]llo");
The given regular expression is valid for all the words, which have either ‘a’, ‘e’ or ‘o’ between ‘h’ and ‘llo’. Regular expression [aeo] means only a single occurrence of either a, e or o.

Regular expression with Negation

Sometimes you want to mention negation in regular expression. Review the example below:
Code:
Pattern pattern = Pattern.compile("H[^aeo]llo");
In the example, the regular expression is valid for all the words that do not have ‘a’, ‘e’, or ‘o’ between ‘H’ and ‘llo’. Which means:

Hello is not valid.
Hallo is not valid.
Hollo is not valid.
Hbllo is valid.


Regular expression with Ranges

Mentioning ranges in regular expression is very useful and interesting. The character ‘-’ is used to mention a range. To mention a range from a to z, we will use [a-z]. To mention a regular expression with numeric digit from 0 to 9, we will use [0-9]. Lets review an example:

Code:
Pattern pattern = Pattern.compile("H[a-z]llo"); CharSequence inputStr = "Hello H8llo Hjllo Hzllo H_llo"; Matcher matcher = pattern.matcher(inputStr); while(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString()); }
Output:

Code:
Pattern found: Hello Pattern found: Hjllo Pattern found: Hzllo
I used range in the regular expression, which included all the alphabets. ‘H8ello’ and ‘H_llo’ were not included in the output because the regular expression did not match them.

Range can also be used with negation. You can negate a range. I rewrote the example and added negation in the regular expression.

Code:
Pattern pattern = Pattern.compile("H[^a-z]llo"); CharSequence inputStr = "Hello H8llo Hjllo Hzllo H_llo"; Matcher matcher = pattern.matcher(inputStr); while(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString()); }
Output:

Code:
Pattern found: H8llo Pattern found: H_llo
Regular expression with Unions
Sometimes you wish to create a regular expression comprising of two or more character classes. This is possible using nesting approach.
Code:
Pattern pattern = Pattern.compile("H[a-e[x-z]]llo");
The above pattern is valid for all words that have a character (between a to e inclusive or between x to z inclusive) between ‘h’ and ‘ello’.
Lets take an example:
Code:
Pattern pattern = Pattern.compile("H[a-e[x-z]]llo"); CharSequence inputStr = "Hello H8llo Hxllo Hzllo H_llo"; Matcher matcher = pattern.matcher(inputStr); while(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString()); }
Output:
Code:
Pattern found: Hello Pattern found: Hxllo Pattern found: Hzllo
Regular expression with Intersections
In order to mention two sets (ranges) for a character we use intersection (&&). It is useful to enforce control in contrast to unions.
Code:
Pattern pattern = Pattern.compile("H[a-z[aei]]llo");
The above pattern defines two sets for the pattern. First is a-z which includes all the lowercase alphabets from a to z inclusive. Second set contains 3 alphabets a. e and i. Any pattern which falls into both sets fulfils the regular expression and is accepted.
Time for an example:
Code:
Pattern pattern = Pattern.compile("H[a-z && [aei]]llo"); CharSequence inputStr = "Hello H8llo Hxllo Hzllo H_llo Hillo"; Matcher matcher = pattern.matcher(inputStr); while(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString()); }
Output:
Code:
Pattern found: Hello Pattern found: Hillo
Regular expression with Intersections and Negation

An interesting combination while constructing regular expressions is to combine intersections with negation. This is helpful to mention a range and then removing some elements from it.

Code:
Pattern pattern = Pattern.compile("H[a-z&&[^u-z]llo");
The above pattern is valid for the words that have any lowercase alphabet (apart from alphabets from u to z) between ‘H’ and ‘llo’. For better understating, lets take an example:

Code:
Pattern pattern = Pattern.compile("H[a-z && [^u-z]]llo"); CharSequence inputStr = "Hallo H2llo Hxllo Hzllo H_llo Hillo"; Matcher matcher = pattern.matcher(inputStr); while(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString());
Output:

Code:
Pattern found: Hallo Pattern found: Hillo
Useful predefined character classes (shortcuts)

There are a lot of predefined character classes in Pattern API, which makes declaring patterns easier and simple. There are referred as shortcuts for regular expression:

. Any character
\d A digit- equivalent to [0-9]
\D A non-digit - equivalent to [^0-9]
\s A whitespace character - equivalent to [ \t\n\x0B\f\r]
\S A non-whitespace character- equivalent to [^\s]
\w A word character- equivalent to [a-zA-Z_0-9]
\W A non-word character- equivalent to [^\w]


Thing to note is tat shortcut with upper case (\W, \S and \D) are used for negation.

Code:
Pattern pattern = Pattern.compile("H[a-z]llo\\d"); CharSequence inputStr = "Hallo H2llo Hxllo Hzllo2 H_llo Hillo1"; Matcher matcher = pattern.matcher(inputStr); while(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString()); }
Output:

[/code]
Pattern found: Hzllo2
Pattern found: Hillo1
[code]

The above example can also be written as follows:

Code:
Pattern pattern = Pattern.compile("H\\wllo\\w"); CharSequence inputStr = "Hallo H2llo Hxllo Hzllo2 H_llo Hillo1"; Matcher matcher = pattern.matcher(inputStr); while(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString()); }
Output:

Code:
Pattern found: Hzllo2 Pattern found: Hillo1
Using Quantifiers in regular expressions

So far, we have worked with regular expressions that defines pattern for one occurrence of a character. Quantifiers are used to specify the number of occurrences to match against. There are very useful in declaring patterns in real life applications. Lets explore this:

X{n} – means exactly n occurrences of X.
X{n,} – means at least n occurrences of X and there is no limit for maximum occurrence.
X{n,m} – means at least n and at most m occurrences of X.
X+ - means one or more occurrences of X.
x*- means zero or more occurrences of X.

The following example uses * quantifier to declare the pattern.

Code:
Pattern pattern = Pattern.compile("He*llo"); CharSequence inputStr = "Hallo Hello Heello Hzllo2 H_llo Hillo1"; Matcher matcher = pattern.matcher(inputStr); while(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString()); }
Output:

Code:
Pattern found: Hello Pattern found: Heello
Now lets use some other quantifiers for better understanding:

Code:
Pattern pattern = Pattern.compile("He{1,2}l{2}o"); CharSequence inputStr = "Hallo Hello Heello Heeello H_llo Hillo1"; Matcher matcher = pattern.matcher(inputStr); while(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString()); }
Output:

Code:
Pattern found: Hello Pattern found: Heello
The above example presented some interesting facts. ‘Heeello’ is ignored and is not part of the output because it violates e{1,2} by having 2 ‘e’ characters.

Capturing groups using regular expressions

In the examples above, we have been capturing character and not the groups.

For example. abc{3} means that look for the pattern that has ‘a’ followed by ‘b’ followed by 3 occurrences of ‘c’.

What if you want to look for the 3 occurrences of ‘abc’? In that case, you have to specify the expression as: (abc){3}


Code:
Pattern pattern = Pattern.compile("(Hello){3}"); CharSequence inputStr = "HalloHelloHello HelloHelloHello Heeello H_llo Hillo1"; Matcher matcher = pattern.matcher(inputStr); while(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString()); }
Output:

Code:
Pattern found: HelloHelloHello
Now lets take a little complex example. In the example below, I have used 3 quantifiers in one pattern. Of course you may use any number of quantifiers you wish.

Code:
Pattern pattern = Pattern.compile("(H[ae]l{2}o){3}"); CharSequence inputStr = "HalloHelloHello HelloHelloHello Heeello H_llo Hillo1"; Matcher matcher = pattern.matcher(inputStr); while(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString()); }
Output:

Code:
Pattern found: HalloHelloHello Pattern found: HelloHelloHello
Boundary matchers

Boundary matchers are used to look for a pattern at a specified position. It’s interesting sometimes to know if the pattern exists in the beginning, at the end etc.

^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input but for the final terminator, if any
\z The end of the input

Case insensitive matching

Sometimes you are not bothered about case sensitivity and want to look for a pattern irrespective of case. For that, we can supply Pattern.CASE_INSENSITIVE as second argument of Pattern’s compile method whose signature is as follows:

static Pattern compile(String regex, int flags)

Review the example below:

Code:
Pattern pattern = Pattern.compile("(H[ae]l{2}o){3}",Pattern.CASE_INSENSITIVE); CharSequence inputStr = "HALloHelloHEllo HelloHelloHello Heeello H_llo Hillo1"; Matcher matcher = pattern.matcher(inputStr); while(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString()); }
Output:

Code:
Pattern found: HALloHelloHEllo Pattern found: HelloHelloHello
Regular expression for email addresses

A common use of regular expression is to validate an email address. Writing regular expression for this very purpose is easy. Generally email addresses have a pattern:

xxx@domain.yyy

Assuming that only alphabets (both upper and lower case), digits and underscores are allowed as xxx part, domain name can only have alphabets (upper and lower case) and yyy part can also have only have alphabets (upper and lower case). It is also assumed that yyy part can have only 2 to 3 alphabets. The positioning of @ and . is also very important.

Now review the code below and try to understand the pattern given.

Code:
Pattern pattern = Pattern.compile("[A-Za-z0-9_]*@[A-Za-z]+.[A-Za-z]{2,3}"); CharSequence inputStr = "Hi java@java.com java22@java.com java_1@java.com java_1@java.comd java_1@java.de java@java.com java.com java@.com Heeello "; Matcher matcher = pattern.matcher(inputStr); while(matcher.find()) { int start = matcher.start(); int end = matcher.end(); System.out.println("Pattern found: " + inputStr.subSequence(start, end).toString()); }
Output:

Code:
Pattern found: java@java.com Pattern found: java22@java.com Pattern found: java_1@java.com Pattern found: java_1@java.com Pattern found: java_1@java.de Pattern found: java@java.com
Conclusion

Regex API makes handling and searching for regular expressions very easy. Studying and understand the API is fairly easy and once you have the basics, you can make great use of it. It surely saves time as compared to manually parsing strings using String functions.

Happy coding!
Bookmark Post in Technorati
Reply With Quote
Sponsored Links
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Simple demo of CSV matching using Regular Expressions Java Tip java.util 0 04-16-2008 11:59 PM
Using Quantifiers in regular expressions Java Tip Java Tips 0 01-10-2008 11:43 AM
Capturing Groups using regular expressions Java Tip Java Tips 0 12-25-2007 12:19 PM
Regular expressions quantifiers Java Tip Java Tips 0 12-25-2007 12:18 PM
regular expressions and string matching DennyLoi New To Java 1 11-16-2007 11:15 AM


All times are GMT +3. The time now is 12:45 PM.


VBulletin, Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2006 - 2007, www.java-forums.org