1) Introduction
Regular Expressions are basically patterns of characters which are used to perform certain useful operations on the given input. The operations include finding particular text, replacing the text with some other text, or validating the given text. For example, we can use Regular Expression to check whether the user input is valid for a field like Email Id or a telephone number.
also read:
2) java.util.regex Package
The java.util.regex package provides the necessary classes for using Regular Expressions in a java application. This package has been introduced in Java 1.4. It consists of three main classes namely,
- Pattern
- Matcher
- PatternSyntaxException
3) Pattern Class
A regular expression which is specified as a string, should be first compiled into an instance of Pattern
class. The resulting pattern can then be used to create an instance of Matcher
class which contains various in-built methods that helps in performing a match against the regular expression. Many Matcher
objects can share the same Pattern
object.
Let us now discuss about the important methods in Pattern
class.
a) compile() method
Before working with, we need a compiled form of regular expression pattern, by calling the Pattern.compile()
method which returns a new
Pattern
object. Note that compile()
is a static method, so we dont an instance of the Pattern
class.
There are two forms of compile()
method,
- compile(String regex)
- compile(String regex, int flags)
In the first form of compile()
method, we pass the regular expression that would be compiled. In the second form of this method, we have an additional parameter which is used to specify the match flags that has to be applied. The flags can be either CASE_INSENSITIVE
, MULTILINE
, DOTALL
, UNICODE_CASE
or CANON_EQ
based on which matching would be done.
b) matcher() method
The matcher()
method is used to create new Matcher
object for an input for a given pattern, which can be used to perform matching operations. The syntax is as follows,
matcher(String input)
c) matches() method
The Pattern
class provides a matches()
method, which is a static method. This method returns true only if the entire input text matches the pattern. This method internally depends on the compile()
and matcher()
methods of the Pattern
object. The syntax for this static matches() method is,
Pattern.matches(pattern, inputSequence);
Let us see a simple example,
RegExpTest.java
import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegExpTest { public static void main(String[] args) { String inputStr = "Computer"; String pattern = "Computer"; boolean patternMatched = Pattern.matches(pattern, inputStr); System.out.println(patternMatched); } }
The matches()
method returns true in this case and hence we get the output as true. If in case, we had given the input string as "ComputerScience"
, then the matches()
method would have returned false.
In the static matches()
method, when we specify an input string and a pattern to the Pattern.matches()
method, the pattern gets compiled into a Pattern
object which is used for matching operation. This is inefficient because every time we specify an input string and pattern, compilation of the pattern is done. Hence, its better to use the non static matches()
method in Matcher
class. (This matches()
method would be discussed when dealing with Matcher class in the forth-coming section).
d) pattern() method
Returns the regular expression as a string from which this pattern was compiled.
e) split() method
This method is used to split the given input text based on the given pattern. It returns a String array.
There are two forms of split()
method,
- split(String input)
- split(String input, int limit)
In the second form, we have an argument called limit
which is used to specify the limit i.e the number of resultant strings that have to be obtained by split()
method.
Let us see a simple example for the split()
method,
Pattern pattern = Pattern.compile("ing"); Matcher matcher = pattern.matcher("playingrowinglaughingsleepingweeping"); String[] str = pattern.split(input, 4); for(String st : str) { System.out.println(st); }
In the above code, we had specified the limit of number of Strings to be returned as 4. Hence we would get 4 strings as the result.
The output for the above code is,
play row laugh sleepingweeping
f) flag() method
This method returns this pattern’s match flags which would have been specified when the pattern was compiled.
4) Matcher Class
The Matcher
class which contains various in-built methods such as matches()
, find()
, group()
, replaceFirst()
, replaceAll()
etc., that help us to check whether the desired pattern occurs in the given text or search the desired pattern in the text or to replace the occurrence of the pattern in the text with some other set of characters as per the requirement.
Let us now discuss about the important methods in Matcher
class.
a) matches() method
The matches()
method available in the Matcher
class is used to match an input text against a pattern. This method returns true only if the entire input text sequence matches the pattern. Consider the following example,
String input = "Java1.4, Java1.5, Java1.6"; Pattern pattern = Pattern.compile("Java"); Matcher matcher = pattern.matcher(input); boolean patternMatched = matcher.matches();
In this case the value of patternMatched
will be false since the entire input string "Java1.4, Java1.5, Java1.6"
does not match the regular expression pattern "Java"
and hence matches()
method returns false. Let us use the same Pattern
object in another Matcher
object and see how it works. Consider the following code,
String input = "Java"; Matcher matcher1 = pattern.matcher(input); boolean patternMatched1 = matcher1.matches();
Here, the matches()
method returns true since the entire input sequence matches the pattern "Java"
. The matches()
method finds appropriate use in searching for particular whole words in a given text.
Let us see another example to know about some other methods available in Matcher
class,
String input = "Java1.4, Java1.5, Java1.6"; Pattern pattern = Pattern.compile("Java"); Matcher matcher = pattern.matcher(input); while (matcher.find()){ System.out.println(matcher.group() + ": " +matcher.start() + ": " + matcher.end()); }
The output of the above code is,
Java: 0: 4 Java: 9: 13 Java: 18: 22
In the above example code, we see the usage of find()
, group()
, start
and end()
methods.
Now, let us know about the purpose of those methods.
b) find() method
The find()
method in Matcher
Class returns true if the pattern occurs anywhere in the input string.
It has two forms,
- find()
- find(int start)
In our example, we used the first form of find()
method. It searches for all occurrences of the pattern "Java"
in the given input String "Java1.4, Java1.5, Java1.6"
and then returns true if a subsequence in the input matches the desired pattern.
In the second form of find()
method, we have an argument that is used to specify the start index of find operation.
c) group() method
The group()
method in Matcher
Class returns the piece of input that has matched the pattern.
d) start() method and end() method
The start()
and end()
methods in Matcher
Class return the start and end indexes respectively, for each occurrence of the subsequences in the input text that has matched the defined regular expression pattern.
5) PatternSyntaxException Class
PatternSyntaxException
is an unchecked exception which is thrown when there is any syntax error in a regular expression pattern. It has various methods like getDescription()
, getIndex()
, getMessage()
and getPattern()
which enable us to get the details of the error.
We have just seen very simple examples to understand the basics of Regular Expressions in java, and about the purpose of few often used methods. With this basic knowledge, we shall discuss in depth about Regular Expression in the following sections.
6) Matching any single character
The '.'
character is used to match any single character. If suppose we use a pattern 'ca.'
then this pattern would match string inputs like 'car'
, ‘cat’, 'cap'
etc.. because they start with ca and then followed by another single character. Consider an example to understand this,
Pattern patternObj = Pattern.compile("ca."); Matcher matcher = patternObj.matcher("cap"); if(matcher.matches()){ System.out.println("The given input matched the pattern"); }
The output of the above code is,
The given input matched the pattern
7) Matching Special characters
Suppose we need to specify '.'
character in our pattern to indicate that the string input should contain the '.'
character. But, in Regular Expression '.'
has a specific meaning. So we have to specify it to the compiler that we don’t mean the Regular expression '.'
by escaping it with a '\'
(backslash character) which is a metacharacter. Consider the following example,
Matcher matcher1 = patternObj.matcher("test.java"); Matcher matcher2 = patternObj.matcher("nest.java"); if(matcher1.matches() && matcher2.matches()){ System.out.println("Both the inputs matched the pattern"); }
The output of this code is,
Both the inputs matched the pattern
Hence, the use of '.'
means to match any character, and the use of ‘\.’ means the normal '.'
character. We can use this backslash character wherever we need to specify a special character for some other purpose.
(Note : In Java, the compiler expects a backslash character '\'
to be always prefixed with another backslash character '\'
when used within a String literal.)
8) Matching particular characters
In a given text, we may need to match specific desired characters. We have already seen that the '.'
character will match any single character but now our requirement is to match only 'c'
or 's'
along with other desired set of characters. In such situations, we can enclose the desired characters in parenthesis []
which is a metacharacter used to indicate a character set from which any one character should be available in the given text. Consider the following piece of code that illustrates the same,
Pattern patternObj = Pattern.compile("[CcSs]at"); Matcher matcher = patternObj.matcher("Sat"); if(matcher.matches()){ System.out.println("The given input matched the pattern"); }
The output of this code is,
The given input matched the pattern
In the above example, we have used the pattern '[CcSs]at'
wherein Cc
is used to match C
and c
, and Ss
matches S
and s
. Hence this would match inputs such as Cat
, cat
, Sat
and sat
.
9) Matching range of characters
In Regular expressions, we use the metacharacter '-'
i.e a hyphen symbol to specify a range of characters. For example, we can specify the range of lowercase alphabets as '[a-z]'
and '[A-Z]'
in case of uppercase alphabets.
Consider a situation where we need the input to start with a number from 0
to 3
, then followed by any alphabet from a
to z
and then followed by another number that might range from 7
to 9
. The following code can be used to validate the input against such a pattern,
Pattern patternObj = Pattern.compile("[0-3][a-z][7-9]"); Matcher matcher = patternObj.matcher("2a8"); if(matcher.matches()){ System.out.println("The given input matched the pattern "); }
10) Matching characters apart from a specific list
We can use the '^'
metacharacter to specify one or more characters that we want don’t expect to match. Let us achieve the same requirement in our previous example by using a different pattern that makes use of the metacharacter '^'
,
Pattern patternObj = Pattern.compile("[^3-9][a-z][7-9]");
The use of '^'
character inside []
indicates that those characters specified in it are not expected in the input.
This pattern will match the input "2a8"
as in our previous example.
11) Use of other Metacharacters in Regular Expression
We have already discussed about the use of '.'
and '^'
metacharacters. Let us see the purpose of other metacharacters.
\d
This matches a numeric digit. It is the same as using the character set [0-9]
.
\D
This matches any character which is non-numeric. It is the same as the use of [^0-9]
.
\s
This matches a single whitespace character.
\S
This matches any character which is not a whitespace character.
\w
This matches a word character. It is equivalent to the character class [A-Za-z0-9_]
.
\W
This matches a character that is not a word character. It is equivalent to the
negated character class [^A-Za-z0-9_]
.
[…]
This matches a single character present in between the square paranthesis.
[a-f[s-z]]
This specifies the union of two sets of characters, which is the same as [a-fs-z]
, i.e it
matches any character that might be either from a
to f
and from s
to z
.
[a-m[f-z]]
This specifies the intersection of two sets of characters, which is the same as [f-m]
, i.e it matches any character from f
to m
.
[^…]
This matches any single character, except those characters that are specified inside the square parentheses []
.
12) POSIX Character Classes
The java.util.regex
package provides a set of POSIX character
classes, which are indeed shortcuts to be used in regular expressions that make it easier for us to use instead of specifying the entire pattern.
\p{Lower}
It can be used to match any single lowercase alphabetic character. It is the same as using [a-z]
.
\p{Upper}
It can be used to match any single uppercase alphabet character. It is the same as using [A-Z]
.
\p{Alpha}
It is used to match any alphabetic character. It serves the same purpose as [A-Za-z]
.
\p{Digit}
It is used to match any single digit. It serves the same purpose as [0-9]
.
\p{Punct}
It is the same as using [!"#$%&'()*+,- ./:;?@[\]^_`{|}~]
.
\p{Graph}
It is the same as using [\p{Alpha}\p{Punct}]
.
\p{Print}
It is the same as using [\p{Graph}].
\p{ASCII]
It can be used to match any of the ASCII characters. It serves the same purpose as U+0000
through U+007F
.
\p{XDigit}
It matches a single hexadecimal digit. It is the same as using [0-9a-fA-F]
.
\p{Space}
It is used to match a single whitespace character. It is the same as using [ \t\n\x0B\f\r]
.
\p{Blank}
It matches a single space character or a tab character.
\p{Cntrl}
It matches a control character. It is the same as using [\x00-\x1F\x7F]
.
13) Conclusion
also read:
This article is just an introduction to Regular expressions. We have seen the basics on how to use regular expressions to perform useful operations such as search, replace and validation for a given input. Regular Expressions can be effectively used to suit our application needs that involve text manipulation.