Regular expressions

Common characteristics of text are defined as regular expression, which is used in processing. One must know the syntax of regular expression in order to use them. Learning regular expression is not difficult at all.

A common use or regular expression is in search tools. You may specify regular expression to search for some text and searching is performed based on that pattern.

Java provides support for handling regular expressions by providing regex API (java.util.regex).

Package java.util.regex

Regex API has only two classes: Matcher and Pattern. It includes an exception called PatternSyntaxException.
Pattern object refers to the regular expression. Thing to note is that Pattern class does not contain public constructor which means creating pattern is tricky. Pattern object is created using compile method which is static.

Java Code:
Pattern pattern = Pattern.compile(regexPattern);
Compile method accepts a regular expression as argument. So you have to define the regular expression before creating the pattern object.

Next step is to create a Matcher object.

Matcher is used to search the regular expression in a given text. It also does not have a public constructor. A Matcher object is created by invoking the matcher method on a Pattern object.

Java Code:
Matcher matcher = pattern.matcher(inputStr);
Now coming to PatternSyntaxException. PatternSyntaxException is raised when there is a syntax error in a regular expression pattern.

A simple example

Time for a basic and simple example.

Java Code:
Pattern pattern = Pattern.compile("Hello[0-9]*");

CharSequence inputStr = "Hello22. I am fine.";
Matcher matcher = pattern.matcher(inputStr);

if(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
}
else
System.out.println("Not found.");
Output:
Java Code:
Pattern found:  Hello22
Pattern object was created using static method compile. Regular expression was supplied as argument. CharSequence contains the source data. Matcher object is created using matcher(…) method and is supplied with the source data as argument. All is set now. Method find() is called with matcher object to look for the pattern in the source data. If found, the text is displayed on the console, otherwise, “Not found.” message will be displayed on the console.

If you believe there can be more than one occurrence of the pattern in the text, you can use a while loop to get all the matches:

Java Code:
while(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
}
Regular expression with characters

The simplest form of regular expression comprise of characters enclosed in square brackets. For example:

Java Code:
Pattern pattern = Pattern.compile("H[aeo]llo");
The given regular expression is valid for all the words, which have either ‘a’, ‘e’ or ‘o’ between ‘h’ and ‘llo’. Regular expression [aeo] means only a single occurrence of either a, e or o.

Regular expression with Negation

Sometimes you want to mention negation in regular expression. Review the example below:
Java Code:
 
Pattern pattern = Pattern.compile("H[^aeo]llo");
In the example, the regular expression is valid for all the words that do not have ‘a’, ‘e’, or ‘o’ between ‘H’ and ‘llo’. Which means:

Hello is not valid.
Hallo is not valid.
Hollo is not valid.
Hbllo is valid.


Regular expression with Ranges

Mentioning ranges in regular expression is very useful and interesting. The character ‘-’ is used to mention a range. To mention a range from a to z, we will use [a-z]. To mention a regular expression with numeric digit from 0 to 9, we will use [0-9]. Lets review an example:

Java Code:
Pattern pattern = Pattern.compile("H[a-z]llo");

CharSequence inputStr = "Hello H8llo Hjllo Hzllo H_llo";
Matcher matcher = pattern.matcher(inputStr);

while(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
}
Output:

Java Code:
Pattern found:  Hello
Pattern found:  Hjllo
Pattern found:  Hzllo
I used range in the regular expression, which included all the alphabets. ‘H8ello’ and ‘H_llo’ were not included in the output because the regular expression did not match them.

Range can also be used with negation. You can negate a range. I rewrote the example and added negation in the regular expression.

Java Code:
Pattern pattern = Pattern.compile("H[^a-z]llo");

CharSequence inputStr = "Hello H8llo Hjllo Hzllo H_llo";
Matcher matcher = pattern.matcher(inputStr);

while(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
}
Output:

Java Code:
Pattern found:  H8llo
Pattern found:  H_llo
Regular expression with Unions
Sometimes you wish to create a regular expression comprising of two or more character classes. This is possible using nesting approach.
Java Code:
Pattern pattern = Pattern.compile("H[a-e[x-z]]llo");
The above pattern is valid for all words that have a character (between a to e inclusive or between x to z inclusive) between ‘h’ and ‘ello’.
Lets take an example:
Java Code:
Pattern pattern = Pattern.compile("H[a-e[x-z]]llo");
CharSequence inputStr = "Hello H8llo Hxllo Hzllo H_llo";
Matcher matcher = pattern.matcher(inputStr);

while(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
}
Output:
Java Code:
Pattern found:  Hello
Pattern found:  Hxllo
Pattern found:  Hzllo
Regular expression with Intersections
In order to mention two sets (ranges) for a character we use intersection (&&). It is useful to enforce control in contrast to unions.
Java Code:
Pattern pattern = Pattern.compile("H[a-z[aei]]llo");
The above pattern defines two sets for the pattern. First is a-z which includes all the lowercase alphabets from a to z inclusive. Second set contains 3 alphabets a. e and i. Any pattern which falls into both sets fulfils the regular expression and is accepted.
Time for an example:
Java Code:
Pattern pattern = Pattern.compile("H[a-z && [aei]]llo");

CharSequence inputStr = "Hello H8llo Hxllo Hzllo H_llo Hillo";
Matcher matcher = pattern.matcher(inputStr);

while(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
}
Output:
Java Code:
Pattern found:  Hello
Pattern found:  Hillo
Regular expression with Intersections and Negation

An interesting combination while constructing regular expressions is to combine intersections with negation. This is helpful to mention a range and then removing some elements from it.

Java Code:
Pattern pattern = Pattern.compile("H[a-z&&[^u-z]llo");
The above pattern is valid for the words that have any lowercase alphabet (apart from alphabets from u to z) between ‘H’ and ‘llo’. For better understating, lets take an example:

Java Code:
Pattern pattern = Pattern.compile("H[a-z && [^u-z]]llo");

CharSequence inputStr = "Hallo H2llo Hxllo Hzllo H_llo Hillo";
Matcher matcher = pattern.matcher(inputStr);

while(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
Output:

Java Code:
Pattern found:  Hallo
Pattern found:  Hillo
Useful predefined character classes (shortcuts)

There are a lot of predefined character classes in Pattern API, which makes declaring patterns easier and simple. There are referred as shortcuts for regular expression:

. Any character
\d A digit- equivalent to [0-9]
\D A non-digit - equivalent to [^0-9]
\s A whitespace character - equivalent to [ \t\n\x0B\f\r]
\S A non-whitespace character- equivalent to [^\s]
\w A word character- equivalent to [a-zA-Z_0-9]
\W A non-word character- equivalent to [^\w]


Thing to note is tat shortcut with upper case (\W, \S and \D) are used for negation.

Java Code:
Pattern pattern = Pattern.compile("H[a-z]llo\\d");

CharSequence inputStr = "Hallo H2llo Hxllo Hzllo2 H_llo Hillo1";
Matcher matcher = pattern.matcher(inputStr);

while(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
}
Output:

[/code]
Pattern found: Hzllo2
Pattern found: Hillo1
[code]

The above example can also be written as follows:

Java Code:
Pattern pattern = Pattern.compile("H\\wllo\\w");

CharSequence inputStr = "Hallo H2llo Hxllo Hzllo2 H_llo Hillo1";
Matcher matcher = pattern.matcher(inputStr);

while(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
}
Output:

Java Code:
Pattern found:  Hzllo2
Pattern found:  Hillo1
Using Quantifiers in regular expressions

So far, we have worked with regular expressions that defines pattern for one occurrence of a character. Quantifiers are used to specify the number of occurrences to match against. There are very useful in declaring patterns in real life applications. Lets explore this:

X{n} – means exactly n occurrences of X.
X{n,} – means at least n occurrences of X and there is no limit for maximum occurrence.
X{n,m} – means at least n and at most m occurrences of X.
X+ - means one or more occurrences of X.
x*- means zero or more occurrences of X.

The following example uses * quantifier to declare the pattern.

Java Code:
Pattern pattern = Pattern.compile("He*llo");

CharSequence inputStr = "Hallo Hello Heello Hzllo2 H_llo Hillo1";
Matcher matcher = pattern.matcher(inputStr);

while(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
}
Output:

Java Code:
Pattern found:  Hello
Pattern found:  Heello
Now lets use some other quantifiers for better understanding:

Java Code:
Pattern pattern = Pattern.compile("He{1,2}l{2}o");

CharSequence inputStr = "Hallo Hello Heello Heeello H_llo Hillo1";
Matcher matcher = pattern.matcher(inputStr);

while(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
}
Output:

Java Code:
Pattern found:  Hello
Pattern found:  Heello
The above example presented some interesting facts. ‘Heeello’ is ignored and is not part of the output because it violates e{1,2} by having 2 ‘e’ characters.

Capturing groups using regular expressions

In the examples above, we have been capturing character and not the groups.

For example. abc{3} means that look for the pattern that has ‘a’ followed by ‘b’ followed by 3 occurrences of ‘c’.

What if you want to look for the 3 occurrences of ‘abc’? In that case, you have to specify the expression as: (abc){3}


Java Code:
Pattern pattern = Pattern.compile("(Hello){3}");

CharSequence inputStr = "HalloHelloHello HelloHelloHello Heeello H_llo Hillo1";
Matcher matcher = pattern.matcher(inputStr);

while(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
}
Output:

Java Code:
Pattern found:  HelloHelloHello
Now lets take a little complex example. In the example below, I have used 3 quantifiers in one pattern. Of course you may use any number of quantifiers you wish.

Java Code:
Pattern pattern = Pattern.compile("(H[ae]l{2}o){3}");

CharSequence inputStr = "HalloHelloHello HelloHelloHello Heeello H_llo Hillo1";
Matcher matcher = pattern.matcher(inputStr);

while(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
}
Output:

Java Code:
Pattern found:  HalloHelloHello
Pattern found:  HelloHelloHello
Boundary matchers

Boundary matchers are used to look for a pattern at a specified position. It’s interesting sometimes to know if the pattern exists in the beginning, at the end etc.

^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input but for the final terminator, if any
\z The end of the input

Case insensitive matching

Sometimes you are not bothered about case sensitivity and want to look for a pattern irrespective of case. For that, we can supply Pattern.CASE_INSENSITIVE as second argument of Pattern’s compile method whose signature is as follows:

static Pattern compile(String regex, int flags)

Review the example below:

Java Code:
Pattern pattern = Pattern.compile("(H[ae]l{2}o){3}",Pattern.CASE_INSENSITIVE);

CharSequence inputStr = "HALloHelloHEllo HelloHelloHello Heeello H_llo Hillo1";
Matcher matcher = pattern.matcher(inputStr);

while(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
}
Output:

Java Code:
Pattern found:  HALloHelloHEllo
Pattern found:  HelloHelloHello
Regular expression for email addresses

A common use of regular expression is to validate an email address. Writing regular expression for this very purpose is easy. Generally email addresses have a pattern:

xxx@domain.yyy

Assuming that only alphabets (both upper and lower case), digits and underscores are allowed as xxx part, domain name can only have alphabets (upper and lower case) and yyy part can also have only have alphabets (upper and lower case). It is also assumed that yyy part can have only 2 to 3 alphabets. The positioning of @ and . is also very important.

Now review the code below and try to understand the pattern given.

Java Code:
Pattern pattern = Pattern.compile("[A-Za-z0-9_]*@[A-Za-z]+.[A-Za-z]{2,3}");

CharSequence inputStr = "Hi java@java.com java22@java.com java_1@java.com 
java_1@java.comd java_1@java.de  java@java.com 
java.com java@.com Heeello ";
Matcher matcher = pattern.matcher(inputStr);

while(matcher.find())
{
int start = matcher.start();
int end = matcher.end();
System.out.println("Pattern found:  " + inputStr.subSequence(start, end).toString());
}
Output:

Java Code:
Pattern found:  java@java.com
Pattern found:  java22@java.com
Pattern found:  java_1@java.com
Pattern found:  java_1@java.com
Pattern found:  java_1@java.de
Pattern found:  java@java.com
Conclusion

Regex API makes handling and searching for regular expressions very easy. Studying and understand the API is fairly easy and once you have the basics, you can make great use of it. It surely saves time as compared to manually parsing strings using String functions.

Happy coding!